Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

Jure Leskovec; Shirley Wu; Tianlang Chen

arxiv: 2605.24432 · v1 · pith:4X6OZKMEnew · submitted 2026-05-23 · 💻 cs.CL

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

Tianlang Chen , Shirley Wu , Jure Leskovec This is my paper

Pith reviewed 2026-06-30 13:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLMmulti-turn conversationself-distillationLost-in-Conversationself-teachingView-Asymmetric Self-Distillationconversational AI

0 comments

The pith

LLMs can teach themselves to recover single-turn performance in multi-turn conversations via self-distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose capability when information arrives across multiple conversational turns instead of all at once, a gap called Lost-in-Conversation. The paper introduces Found in Conversation, a method in which each model generates its own single-turn answers and then uses those answers to train itself on the corresponding multi-turn versions of the same queries. View-Asymmetric Self-Distillation transfers the stronger single-turn behavior into the weaker multi-turn setting without any external teacher model. Experiments across Llama, Qwen, Phi, and OLMo families show recovery of at least 92 percent of single-turn performance and full recovery on two Llama backbones. A reader cares because the approach makes natural, underspecified dialogue more reliable while leaving single-turn skills untouched.

Core claim

By treating the single-turn view of each task as teacher and the multi-turn view as student, View-Asymmetric Self-Distillation lets a model distill its own strong single-turn responses into improved multi-turn behavior, recovering at least 92 percent of single-turn performance across Llama, Qwen, Phi, and OLMo models sized 3B to 14B and reaching 100 percent on two Llama backbones.

What carries the argument

View-Asymmetric Self-Distillation, which distills from the model's single-turn responses as teacher signal into its multi-turn responses as student.

If this is right

Multi-turn conversations become more efficient and helpful while single-turn performance stays unchanged.
No external stronger model is required, since the method relies only on the target model's own single-turn outputs.
The same recovery holds across four model families and a range of sizes from 3B to 14B parameters.
Self-generated data can be used directly for this distillation step without external supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce reliance on human-written multi-turn training data for conversational agents.
Similar view-asymmetric distillation could be tested on other underspecified tasks such as long-context reasoning or tool use.
Iterating the process multiple times might produce further gains or reveal limits of self-generated teacher signals.

Load-bearing premise

The single-turn view supplies an unbiased teacher signal for the multi-turn student without degrading other capabilities or injecting artifacts from self-generated data.

What would settle it

A post-training evaluation showing clear drops on held-out single-turn tasks or measurable increases in hallucination rates traceable to the self-distilled data would falsify the claim that capabilities remain intact.

Figures

Figures reproduced from arXiv: 2605.24432 by Jure Leskovec, Shirley Wu, Tianlang Chen.

**Figure 1.** Figure 1: Overview of FOUND IN CONVERSATION. SFT warm-starts the model: build a multi-turn corpus from standard single-turn benchmark, construct per-turn gold answers with diversity, and train with a standard causal-LM loss. VASD runs the same backbone on two information-equivalent views: a Teacher view on single-turn setting and a Student view on the multi-turn setting with intermediate responses sampled on-poliy. … view at source ↗

**Figure 2.** Figure 2: Three information-equivalent views of the same instruction. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of LLAMA-3.1-8B baseline (left) and FIC (right) on a multi-turn GSM8K problem. The baseline anchors on hallucinated assumptions in early turns and propagates them to a wrong final answer. FIC defers at intermediate turns, requesting specific missing information, and produces the correct answer once information is complete. from earlier turns and propagates them through to a wrong fi… view at source ↗

read the original abstract

Large Language Model (LLM) interactions are typically underspecified, with users clarifying all necessary details across multiple conversational turns. Yet recent work shows that LLMs perform far worse in this multi-turn setting than in a single turn with same information being available at once, a phenomenon termed "Lost-in-Conversation." However, bridging this gap effectively remains an open problem. Here we introduce Found in Conversation (FiC), a training framework where a model teaches itself to find and recover its single-turn competence given underspecified multi-turn prompts. We develop View-Asymmetric Self-Distillation, which distills across two views of the same task information--single-turn view for the teacher, multi-turn view for the student--transferring strong single-turn behavior into weak multi-turn behavior. This requires no stronger external teacher, which is unavailable as even frontier LLMs exhibit this gap. Across model families (Llama, Qwen, Phi, and OLMo) and sizes (3B-14B), FiC recovers at least 92% of single-turn performance and reaches 100% on two Llama backbones, yielding more efficient and helpful multi-turn conversations with single-turn capabilities intact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FiC shows a workable self-distillation fix that recovers 92%+ of single-turn performance in multi-turn LLM use across four model families.

read the letter

The main takeaway is that this paper gives a self-contained way for models to close most of the multi-turn gap by distilling from a single-turn teacher view to a multi-turn student view on the same task.

The actual contribution is the view-asymmetric setup in Found in Conversation. The teacher sees the complete information in one turn while the student sees the underspecified multi-turn version, and the transfer happens without calling on a stronger external model. That matches the setting, since the gap shows up even in large models. The tests run on Llama, Qwen, Phi, and OLMo from 3B to 14B and report at least 92% recovery, with full recovery on two Llama backbones. The coverage across families is a clear plus and makes the result less likely to be an artifact of one architecture.

The method stays non-circular by anchoring the teacher on the stronger single-turn signal. The reported numbers line up with the goal of keeping single-turn capability while lifting multi-turn behavior.

The main soft spot is the risk that self-generated multi-turn data could slowly shift the distribution in ways the single-turn teacher does not catch. The abstract gives no evidence of degradation or capability drift, so the concern may not materialize in their runs, but it would be useful to see the data-generation details and any checks on unrelated tasks. That is a standard thing to verify rather than a fatal gap.

This is for people who fine-tune or deploy conversational LLMs and need a practical lever for multi-turn performance. A reader who cares about training techniques that do not require frontier teachers will get direct use from the framework and the cross-model numbers. It has a clear method, reasonable scale of testing, and a real deployment problem behind it, so it deserves peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces Found in Conversation (FiC), a self-distillation training framework called View-Asymmetric Self-Distillation in which an LLM uses its own single-turn view of a task as teacher to improve its multi-turn (underspecified) view as student. The goal is to recover single-turn competence in multi-turn settings without external teachers. Across Llama, Qwen, Phi, and OLMo families (3B–14B), the method is reported to recover ≥92% of single-turn performance and 100% on two Llama backbones.

Significance. If the reported recovery rates are supported by rigorous experiments, the work would be significant for conversational LLM applications: it offers a scalable, self-contained way to mitigate the documented multi-turn performance drop without relying on stronger external models. The view-asymmetric distillation approach is a concrete contribution that could improve efficiency and helpfulness in real multi-turn interactions while preserving other capabilities.

major comments (2)

Abstract and Methods: the central empirical claim (≥92% recovery, 100% on two Llama backbones) is load-bearing, yet the abstract supplies no information on task definitions, how single-turn vs. multi-turn views are constructed from the same underlying information, number of turns, datasets, or evaluation metrics. Without these details it is impossible to verify that the single-turn teacher signal remains unbiased for the multi-turn student.
[Results] Results section: the cross-family and cross-size consistency is asserted, but no baselines, variance estimates, number of runs, or ablations on capability drift / other tasks are referenced. This leaves the claim that single-turn capabilities remain intact unsupported and prevents assessment of whether the 92% figure reflects genuine transfer or evaluation artifacts.

minor comments (1)

The abstract states that FiC yields 'more efficient and helpful multi-turn conversations' but does not define or report separate metrics for efficiency or helpfulness beyond the recovery percentage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and commit to revisions that improve clarity and rigor without altering the core claims.

read point-by-point responses

Referee: [—] Abstract and Methods: the central empirical claim (≥92% recovery, 100% on two Llama backbones) is load-bearing, yet the abstract supplies no information on task definitions, how single-turn vs. multi-turn views are constructed from the same underlying information, number of turns, datasets, or evaluation metrics. Without these details it is impossible to verify that the single-turn teacher signal remains unbiased for the multi-turn student.

Authors: We agree that the abstract should include sufficient detail for readers to understand the experimental setup and verify the unbiased nature of the teacher signal. In the revised version we will expand the abstract to specify task definitions, how single-turn and multi-turn views are derived from identical underlying information, the number of turns, the datasets, and the evaluation metrics, while ensuring the methods section explicitly describes the view-asymmetric construction that keeps the single-turn teacher unbiased. revision: yes
Referee: [Results] Results section: the cross-family and cross-size consistency is asserted, but no baselines, variance estimates, number of runs, or ablations on capability drift / other tasks are referenced. This leaves the claim that single-turn capabilities remain intact unsupported and prevents assessment of whether the 92% figure reflects genuine transfer or evaluation artifacts.

Authors: We acknowledge that the results section would benefit from explicit reporting of these elements to substantiate the claims. The revised manuscript will add relevant baselines, variance estimates, the number of runs, and ablations on capability drift and other tasks to demonstrate that single-turn performance remains intact and that the reported recovery rates reflect genuine transfer rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical self-distillation framework is self-contained

full rationale

The paper introduces View-Asymmetric Self-Distillation as an empirical training procedure that uses single-turn task views as the teacher signal to improve multi-turn student behavior, with no stronger external model required. Reported outcomes (≥92% recovery of single-turn performance across Llama/Qwen/Phi/OLMo families, 100% on two Llama backbones) are framed as measured experimental results from applying the method rather than quantities algebraically forced by the method's own definitions or fitted parameters. The abstract and described framework contain no equations, no self-citations invoked as load-bearing uniqueness theorems, no ansatzes smuggled via prior work, and no renaming of known patterns as new derivations. The central claim therefore rests on the falsifiable empirical hypothesis that the single-turn teacher signal transfers without introducing artifacts, which is tested by the reported recovery rates and does not reduce to a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.1-grok · 5743 in / 1052 out tokens · 46701 ms · 2026-06-30T13:31:00.344797+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Constitutional AI: Harmlessness from AI Feedback

URLhttps://openreview.net/forum?id=CrzAj0kZjR. Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/ introducing-claude/. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml.20251026 2023
[2]

contain on the order of 100 instructions per task, which limits the resolution of recovery-rate estimates. We mitigate this by averaging n= 10 simulations per instance (amounting to roughly 1,000 scored conversations per model-condition cell) and by reporting consistent trends across five backbones and three out-of-domain tasks, but a larger, more diverse...

2025
[3]

let’s assume

Do NOT assume, fabricate, or guess any values not explicitly stated by the user. If a quantity is unknown, state that it is unknown -- never substitute a hypothetical value or use phrases like "let’s assume" or "for example, $200 each"
[4]

If any value is still missing, withhold computation and instead state what information is still needed

Do NOT compute a final numerical answer until ALL quantities required for the calculation have been explicitly provided by the user. If any value is still missing, withhold computation and instead state what information is still needed
[5]

I still need the following information to solve this: [list]

When information is incomplete, respond with "I still need the following information to solve this: [list]" rather than filling in gaps yourself
[6]

Do not hesitate or ask for confirmation -- if the math can be done with the given values, do it

HOWEVER, as soon as you have enough explicitly stated values to perform the calculation, you MUST compute and present the final numerical answer immediately. Do not hesitate or ask for confirmation -- if the math can be done with the given values, do it. PROMPT-SELFCHECK. You are solving a mathematical problem where the user will provide information gradu...

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

URLhttps://openreview.net/forum?id=CrzAj0kZjR. Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/ introducing-claude/. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml.20251026 2023

[2] [2]

contain on the order of 100 instructions per task, which limits the resolution of recovery-rate estimates. We mitigate this by averaging n= 10 simulations per instance (amounting to roughly 1,000 scored conversations per model-condition cell) and by reporting consistent trends across five backbones and three out-of-domain tasks, but a larger, more diverse...

2025

[3] [3]

let’s assume

Do NOT assume, fabricate, or guess any values not explicitly stated by the user. If a quantity is unknown, state that it is unknown -- never substitute a hypothetical value or use phrases like "let’s assume" or "for example, $200 each"

[4] [4]

If any value is still missing, withhold computation and instead state what information is still needed

Do NOT compute a final numerical answer until ALL quantities required for the calculation have been explicitly provided by the user. If any value is still missing, withhold computation and instead state what information is still needed

[5] [5]

I still need the following information to solve this: [list]

When information is incomplete, respond with "I still need the following information to solve this: [list]" rather than filling in gaps yourself

[6] [6]

Do not hesitate or ask for confirmation -- if the math can be done with the given values, do it

HOWEVER, as soon as you have enough explicitly stated values to perform the calculation, you MUST compute and present the final numerical answer immediately. Do not hesitate or ask for confirmation -- if the math can be done with the given values, do it. PROMPT-SELFCHECK. You are solving a mathematical problem where the user will provide information gradu...