arxiv: 2603.20133 · v2 · submitted 2026-03-20 · 💻 cs.CL

Recognition: no theorem link

Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kart\'a\v{c} , Mateusz Lango , Ond\v{r}ej Du\v{s}ek

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM reasoningtask-oriented dialoguemulti-turn evaluationbenchmark constructionperformance gaparithmetic reasoningspatial reasoningtemporal reasoning

0 comments

The pith

LLMs show a substantial and consistent drop in reasoning performance when tasks are embedded inside multi-turn dialogues rather than presented in isolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current reasoning benchmarks, which test models on isolated problems, overestimate LLM capabilities in realistic task-oriented dialogue settings where reasoning must occur alongside text generation and strict adherence to role, format, and style instructions. To demonstrate this, the authors introduce the BOULDER benchmark, which pairs eight travel-related tasks involving arithmetic, spatial, and temporal reasoning in both isolated and dialogue-based formats. Experiments across eight different LLMs reveal a clear performance gap that ablations attribute primarily to the multi-turn structure of dialogue, along with secondary influences from role conditioning and tool-use demands. This matters for anyone relying on benchmarks to predict how models will behave in interactive applications such as travel assistants or customer service agents.

Core claim

When the same reasoning problems are placed inside task-oriented dialogues instead of given as standalone questions, LLMs exhibit a substantial and consistent performance decline; controlled ablations indicate the multi-turn nature of dialogue is the dominant factor, with additional contributions from role instructions and tool-use requirements.

What carries the argument

BOULDER, a dynamic benchmark that supplies matched isolated and dialogue variants for eight travel tasks requiring arithmetic, spatial, temporal, commonsense, and formal reasoning.

If this is right

Isolated-benchmark scores cannot be treated as reliable proxies for reasoning ability in live interactive systems.
Model evaluation protocols should routinely include multi-turn dialogue variants to surface hidden weaknesses.
Training objectives may need explicit exposure to interleaved reasoning and instruction-following within conversations.
Tool-augmented dialogue agents will likely underperform on reasoning steps unless the dialogue context itself is part of the training signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of dialogue systems may need new fine-tuning regimes that simulate full conversation histories rather than single prompts.
The gap could widen further in longer sessions or with more complex role constraints, suggesting a scaling law for dialogue depth.
Existing safety and alignment techniques tuned on isolated prompts may transfer poorly once models must maintain persona across turns.

Load-bearing premise

The observed performance gap stems mainly from the multi-turn interactive format rather than from benchmark artifacts, differing levels of data contamination, or variations in prompt length and structure.

What would settle it

Re-running the eight tasks with dialogue variants that are forced to single-turn responses while exactly matching token length, role instructions, and output format constraints, then finding no remaining performance gap.

Figures

Figures reproduced from arXiv: 2603.20133 by Ivan Kart\'a\v{c}, Mateusz Lango, Ond\v{r}ej Du\v{s}ek.

**Figure 1.** Figure 1: An example from our BOULDER benchmark, showing the same problem instance in two variants: as an isolated task and within a task-oriented dialogue. as part of a broader task has not yet been sufficiently explored (Cui et al., 2020; Li et al., 2023). Within language processing tasks, LLMs are typically expected to adhere to various instructions regarding their role, output format, style, or response lengt… view at source ↗

**Figure 2.** Figure 2: Evaluation results of 8 LLMs in the three [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 5.** Figure 5: Results for the Baseline with dialogue role and Dialogue with reasoning instruction ablations, microaveraged over all tasks. Significance indication follows [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Detailed per-task results for the Baseline (b), Dialogue (d), and Dialogue-concise (c) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). shorter responses on average in the dialogue setup. A hypothesis is that this is caused by a bias induced by the role of a TOD assistant they are instructed to follow (Gupta et al.… view at source ↗

**Figure 7.** Figure 7: Example conversation template in JSON format for the [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template used for generating paraphrases of conversation templates. Paraphrases generated in [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt template for the task-oriented dialogue system used in the [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template for the task-oriented dialogue system used in the [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Changes made to the prompt for the Dialogue with reduced domains ablation. Text highlighted in red indicates removed content. This example illustrates the changes made to the distance and opening hours tasks, where we remove the attractions and trains domains. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Changes made to the prompt and the message history for the [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt template for the amounts parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt template for the restaurant names parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗

**Figure 22.** Figure 22: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt template for the time parser used to extract values in the [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗

**Figure 24.** Figure 24: Detailed results for the Dialogue with reduced domains (d(d)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). Qwen3 4B Mistral Small 24B Qwen3 30B A3B Command A 111B Qwen3 235B A22B DeepSeek V3.2 Gemini 2.5 Flash Claude Sonnet 4.5 Ticket Price Accuracy Boo… view at source ↗

**Figure 25.** Figure 25: Detailed results for the Dialogue without tools (d(t)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). Qwen3 4B Mistral Small 24B Qwen3 30B A3B Command A 111B Qwen3 235B A22B DeepSeek V3.2 Gemini 2.5 Flash Claude Sonnet 4.5 Ticket Price Accuracy Booking Pr… view at source ↗

**Figure 26.** Figure 26: Detailed results for the Multi-turn baseline (b(m)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗

**Figure 27.** Figure 27: Detailed results for the Single-turn dialogue (d(s)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). Qwen3 4B Mistral Small 24B Qwen3 30B A3B Command A 111B Qwen3 235B A22B DeepSeek V3.2 Gemini 2.5 Flash Claude Sonnet 4.5 Ticket Price Accuracy Booking Pric… view at source ↗

**Figure 28.** Figure 28: Detailed results for the Baseline with dialogue role (b(r)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). Qwen3 4B Mistral Small 24B Qwen3 30B A3B Command A 111B Qwen3 235B A22B DeepSeek V3.2 Gemini 2.5 Flash Claude Sonnet 4.5 Ticket Price Accuracy Booki… view at source ↗

**Figure 29.** Figure 29: Detailed results for the Dialogue with reasoning instructions (d(r)) ablation, compared with the Baseline (b) and Dialogue (d) setups. Asterisks indicate significant differences between setups in neighboring columns (t-test, ∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001). 33 [PITH_FULL_IMAGE:figures/full_fig_p033_29.png] view at source ↗

**Figure 30.** Figure 30: Detailed results for the parser evaluation by task, model, and evaluation setting. The error bars show [PITH_FULL_IMAGE:figures/full_fig_p034_30.png] view at source ↗

**Figure 31.** Figure 31: Scores and average response lengths in characters by LLM averaged over all tasks. [PITH_FULL_IMAGE:figures/full_fig_p035_31.png] view at source ↗

**Figure 32.** Figure 32: Comparison of scores and average response lengths in characters by LLM for the [PITH_FULL_IMAGE:figures/full_fig_p036_32.png] view at source ↗

**Figure 33.** Figure 33: Comparison of scores and average response lengths in characters by LLM for the [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗

**Figure 34.** Figure 34: Comparison of responses from Qwen3 30B A3B for the [PITH_FULL_IMAGE:figures/full_fig_p038_34.png] view at source ↗

**Figure 35.** Figure 35: Comparison of responses from Command A 111B for the [PITH_FULL_IMAGE:figures/full_fig_p039_35.png] view at source ↗

**Figure 36.** Figure 36: Comparison of responses from Qwen3 235B A22B for the [PITH_FULL_IMAGE:figures/full_fig_p040_36.png] view at source ↗

**Figure 37.** Figure 37: Comparison of responses from Claude 4.5 Sonnet for the [PITH_FULL_IMAGE:figures/full_fig_p041_37.png] view at source ↗

read the original abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BOULDER gives a clean isolated-versus-dialogue comparison on the same problems and finds a consistent performance drop, but the multi-turn attribution still needs tighter checks on prompt differences.

read the letter

The key point is that this paper gives a direct head-to-head on the same reasoning problems run once in isolation and once inside a dialogue. BOULDER covers eight travel tasks that mix arithmetic, spatial, and temporal reasoning. They show a consistent drop when the model has to operate under dialogue constraints like history, role, format, and style. The strength is the controlled comparison. By reusing the exact same problems they reduce the chance that the gap comes from different difficulty levels or data leaks. The ablations separate multi-turn effects from role conditioning and tool use, and they conclude that turn count is the main culprit. That lines up with how these models are actually deployed in interactive settings. The weaker part is the attribution. The dialogue versions include extra instructions on output format and role that the isolated versions may not match exactly. Even after ablations, it is possible some of the remaining gap comes from cumulative prompt length or stricter formatting rules rather than the conversational structure alone. The abstract also skips the actual numbers, error bars, and task counts, so the magnitude stays hard to gauge without the tables. This work is aimed at people who evaluate LLMs for real interactive use rather than pure benchmark chasing. The benchmark design is practical and the basic finding is worth knowing. I would send it for peer review. The core experiment is new enough and the question is practical enough that referees should see it, though they will probably ask for more precise controls and full statistics.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BOULDER, a dynamic benchmark of eight travel-related tasks requiring arithmetic, spatial, and temporal reasoning. Each task is instantiated in matched isolated and dialogue-based variants; experiments on eight LLMs report a consistent performance drop in the dialogue setting, which ablations and qualitative analysis attribute primarily to multi-turn interaction, with secondary contributions from role conditioning and tool-use requirements.

Significance. If the controlled comparison holds, the work demonstrates that standard isolated reasoning benchmarks can overestimate LLM capabilities relative to realistic task-oriented dialogue, providing a concrete motivation for interactive evaluation protocols. The benchmark design that re-uses the same underlying problems across framings is a methodological strength that reduces contamination concerns.

major comments (2)

[§4 and §5] §4 (Results) and §5 (Ablations): the central claim that the gap is 'largely driven by the multi-turn nature' requires explicit confirmation that isolated baselines receive identical role, format, and style instructions on every turn; without a side-by-side prompt template comparison, the ablation results risk conflating turn count with differences in instruction complexity and output constraints.
[Abstract and §4] Abstract and §4: the reported 'substantial and consistent performance gap' is stated without numerical values, per-task accuracies, error bars, or exact problem counts; the results tables must include these quantities together with statistical tests to allow readers to assess effect size and variability across the eight models.

minor comments (2)

[§3] §3 (Benchmark construction): specify the exact number of problems generated per task and the precise mechanism used to ensure the isolated and dialogue variants remain semantically identical while varying only the interaction framing.
[Figure 2 and Table 1] Figure 2 and Table 1: axis labels and legend entries should explicitly distinguish 'isolated' from 'dialogue' conditions and indicate whether error bars represent standard deviation across models or across problem instances.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity in our experimental design and results presentation. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4 and §5] §4 (Results) and §5 (Ablations): the central claim that the gap is 'largely driven by the multi-turn nature' requires explicit confirmation that isolated baselines receive identical role, format, and style instructions on every turn; without a side-by-side prompt template comparison, the ablation results risk conflating turn count with differences in instruction complexity and output constraints.

Authors: We agree that explicit confirmation is necessary to strengthen the claim. In the isolated setting, each problem receives the identical role, format, and style instructions as the first turn of the corresponding dialogue variant (with no subsequent turns). To eliminate any ambiguity and allow readers to verify that instruction complexity does not confound the multi-turn factor, we will add side-by-side prompt templates for both conditions in the revised appendix. This addition will directly support the ablation results attributing the gap primarily to multi-turn interaction. revision: yes
Referee: [Abstract and §4] Abstract and §4: the reported 'substantial and consistent performance gap' is stated without numerical values, per-task accuracies, error bars, or exact problem counts; the results tables must include these quantities together with statistical tests to allow readers to assess effect size and variability across the eight models.

Authors: We will revise §4 to expand the results tables with per-task accuracies, exact problem counts per task and model, error bars (standard deviation across multiple runs), and statistical significance tests (paired t-tests with p-values) for the isolated vs. dialogue gaps. These details will enable assessment of effect sizes and variability. The abstract will remain a high-level summary, but the main text will now contain all requested quantitative information. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison on newly introduced paired benchmark variants

full rationale

The paper introduces the BOULDER benchmark with explicitly paired isolated and dialogue-based variants of the same travel-related reasoning problems. Performance gaps are measured through controlled experiments on eight LLMs, supported by ablations that vary role conditioning, tool-use, and turn count. No equations, predictions, or first-principles derivations are present that reduce to their own inputs by construction. The central claim rests on observable experimental differences rather than self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained and externally falsifiable via replication on the released benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the construction of the BOULDER benchmark and the empirical comparison; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5487 in / 1025 out tokens · 40203 ms · 2026-05-15T08:07:36.638798+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C

Prediction-powered Inference.CoRR, abs/2301.09633. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question An- swering. In2015 IEEE International Conference on Computer Vision, ICCV 2015, pages 2425–2433, Santiago, Chile. Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Ji...

work page arXiv 2015
[2]

InThe Thirteenth In- ternational Conference on Learning Representations, ICLR 2025, Singapore

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning. InThe Thirteenth In- ternational Conference on Learning Representations, ICLR 2025, Singapore. Shashank Gupta, Vaishnavi Shrivastava, Ameet Desh- pande, Ashwin Kalyan, Peter Clark, Ashish Sabhar- wal, and Tushar Khot. 2024. Bias Runs Deep: Im- plicit Reasoning Biases in Persona-assigned ...

work page arXiv 2025
[3]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

CoQA: A Conversational Question Answer- ing Challenge.Transactions of the Association for Computational Linguistics, 7:249–266. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Di- rani, Julian Michael, and Samuel R. Bowman. 2023. GPQA: A Graduate-level Google-proof q&a Bench- mark.CoRR, abs/2311.12022. Oscar Sai...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Each task is then implemented as a rule-based procedure

We convert the original GPS coordinates for hotels, restaurants, and attractions to meters, since we work with Euclidean dis- tance in the spatial tasks. Each task is then implemented as a rule-based procedure. Input values and targets for individual examples are selected randomly based on specific constraints, such as the minimum and maximum number of gu...

work page
[5]

Automatically generate new test examples, possibly also modifying task parameters, e.g., ranges or thresholds

work page
[6]

Extend the database with new synthetic data or apply perturbations to existing items

work page
[7]

yes”, “no

Implement generation procedures and dia- logue templates for new tasks This allows generating a large number of new di- verse benchmark examples without any substantial manual effort. A.2 Tasks In this section, we provide details on each task in our benchmark. In all spatial tasks, we represent coordinates in meters from the origin, a (0, 0) point in the ...

work page 1983
[8]

The Gardenia - Open 09:00-22:00 on Thursday✓

work page
[9]

La Mimosa -Closed on Thursdays✗

work page
[10]

every 30 minutes

Shiraz Restaurant - Open 11:00-23:00 on Thursday✓ So both The Gardenia and Shiraz Restaurant are open for the entire time between 15:30 and 19:30 on Thursday. As a special case of post-hoc rationalization dis- cussed in Section E.1, LLMs sometimes fabricate values not present in the data to better match their initial answer. In the following example, the ...

work page
[11]

Single Room (1 person): - Price per night: £50 - Duration: 4 nights - Total: £50 × 4 = £200

work page
[12]

• The Hotpot: Open from 12:00 to 22:30

Family Room (4 people): - Price per night: £130 (for 4 people) - Duration: 4 nights (but one person checks in one day earlier, so only 3 nights) [...] In the example below, Gemini 2.5 Flash inverts the set containment relation between two time in- tervals, rejecting a valid option: Gemini 2.5 Flash–Restaurant opening hours The following Chinese restaurant...

work page
[13]

Tool calls are removed from conversation his- tories and structured data retrieved by these tools are moved to the system prompt

work page
[14]

Instructions about tool use are replaced with instructions on how to use the provided con- text data

work page
[15]

Tool schemas arenotpassed to LLMs during inference 21 Single-turn dialogueTo transform the original dialogue setup to a single-turn variant, we merge all user’s queries in a conversation history into a single query. For example, if the user first asks about trains with specific criteria and then asks about departure frequency in another turn, these questi...

work page
[16]

Message history with a single user prompt is extended with conversation turns from the dialogue setup (excluding tool calls)

work page
[17]

Role of the original user prompt is changed tosystem

work page
[18]

role": "assistant

Query is moved to the lastuser message and replaced with a generic instruction:Given the {domain} data in JSON format below, answer the user’s questions. Baseline with dialogue roleAs illustrated in Fig- ure 15, we add two instructions related to the travel assistant role from the dialogue setup to baseline prompts: • You are a helpful assistant specializ...

work page 2024
[19]

The time interval between 20:00 and midnight is 4 hours, which is equivalent to 240 minutes

TR13067: 23:37 There are 12 trains departing between 20:00 and midnight. The time interval between 20:00 and midnight is 4 hours, which is equivalent to 240 minutes. To calculate the average frequency, we’ll divide the total time interval (240 minutes) by the number of trains (12): Average frequency = Total time interval / Number of trains = 240 minutes /...

work page