Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Chao Zhang; Jingtao Xu; Jinsheng Quan; Quanling Liu; Xing Shi; Yawei Luo; Yifan Zhu; Zhihui Li; Zizhuo Lin

arxiv: 2605.30251 · v1 · pith:MSMKKHDLnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Zizhuo Lin , Quanling Liu , Jinsheng Quan , Chao Zhang , Yifan Zhu , Xing Shi , Jingtao Xu , Zhihui Li

show 1 more author

Yawei Luo

This is my paper

Pith reviewed 2026-06-29 07:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-turn language modelson-policy distillationcontext driftcanonical contextself-anchored assumptionszero-shot generalizationmath reasoning

0 comments

The pith

Training a model to match its full-prompt answers while generating its own multi-turn trajectories reduces drift from partial evidence and improves consistency across tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often reach different final answers when the same complete evidence arrives all at once versus piece by piece across conversation turns. The paper attributes this gap to self-anchored drift, where responses made under partial information introduce unsupported assumptions that later distort the outcome. To address it, the authors introduce Canonical-Context On-Policy Distillation: the same base model serves as a frozen teacher that sees the full evidence in one prompt and as a trainable student that receives the evidence incrementally; the student is aligned to the teacher's canonical answers on trajectories it generates itself. When trained only on math-problem conversations, this produces a 32 percent average relative gain on sharded-evidence versions of math and five other task families while largely keeping full-context performance intact. The approach therefore targets a practical failure mode in conversational use, where information arrives gradually rather than in a single clean prompt.

Core claim

When the same complete user evidence is presented either as a clean full prompt or as a raw-sharded multi-turn conversation, models should produce identical answers; the gap arises because partial responses introduce unsupported assumptions that later distort the final answer. Canonical-Context On-Policy Distillation corrects this by using the base model in two roles: a frozen teacher conditioned on the full prompt supplies the target behavior, while the trainable student receives the identical evidence turn by turn and is trained on-policy to match the teacher's outputs on its own generated trajectories. This yields the reported 32 percent average relative improvement on raw-sharded perform

What carries the argument

Canonical-Context On-Policy Distillation (CCOPD), in which the identical base model acts as frozen teacher on the clean full prompt and as trainable student on incremental conversations, aligning the student's behavior on its own trajectories to the teacher's canonical full-context answers.

If this is right

RAW-SHARDED performance improves by 32 percent relative on average across math and five zero-shot out-of-domain families.
Full-context performance is largely preserved after training.
Grounding in user-provided evidence is strengthened and sensitivity to contamination from prior assistant turns is reduced.
The improvement transfers to tasks outside the math-only training distribution.
The same base model can serve simultaneously as teacher and student without external supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to other incremental-information settings such as long-running dialogues or step-by-step reasoning chains where early assumptions must not propagate unchecked.
If the teacher's full-context answers contain systematic errors, the distillation process would embed those errors into the student's multi-turn behavior.
Extending the training distribution beyond math problems might further widen the set of tasks that benefit from reduced context drift.

Load-bearing premise

The frozen teacher's answers on the clean full prompt are the correct canonical target, and aligning the student to those answers on its own generated trajectories transfers the desired behavior without introducing new drift or distribution shift.

What would settle it

A controlled experiment in which the distilled student is evaluated on full-prompt inputs and produces answers that systematically diverge from the frozen teacher, or in which raw-sharded accuracy shows no gain or a loss relative to the base model.

Figures

Figures reproduced from arXiv: 2605.30251 by Chao Zhang, Jingtao Xu, Jinsheng Quan, Quanling Liu, Xing Shi, Yawei Luo, Yifan Zhu, Zhihui Li, Zizhuo Lin.

**Figure 1.** Figure 1: Part 1: Task-equivalent FULL, CONCAT, and RAW-SHARDED presentations. Part 2: Reduced selfanchored drift and improved canonical-context consistency. succeed when a task is presented as a clean FULL prompt can still fail when the same information is disclosed incrementally through a RAW-SHARDED conversation (Laban et al., 2025; He et al., 2024; Li et al., 2025b). Part 1 of [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 2.** Figure 2: Qwen3-8b-base model probes for selfanchored drift. (a) Masking process-reply commitment spans selectively improves gold-vs-anchor margins for wrong-anchor states. (b) Replacing process replies with neutral placeholders lowers predictive-state deviation from the canonical FULL reference. answers from the realized RAW-SHARDED history, while a frozen copy of the same base model scores the student’s own answe… view at source ↗

**Figure 3.** Figure 3: Overview method of CCOPD. CCOPD aligns a RAW-SHARDED-conditioned student with a frozen FULL-conditioned teacher using answer-masked same-prefix reverse KL. reached after the student conditions on the realized transcript, including any earlier assistant-side commitments contained in h. For each t ∈ Tans(ˆy), the student and teacher score the same prefix yˆ<t under different contexts: p h θ,t = πθ(· | h, yˆ… view at source ↗

**Figure 4.** Figure 4: Mechanism diagnostics for self-anchored drift. (a) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CCOPD applies on-policy distillation from a full-context teacher to cut self-anchored drift in sharded multi-turn settings, with reported 32% relative gains that transfer out of domain, but the abstract gives no check on whether the teacher itself is accurate.

read the letter

The punchline is that this paper trains a student on its own sharded trajectories to match a frozen teacher's outputs on the complete prompt, and claims this lifts RAW-SHARDED performance 32% relative across math and five zero-shot tasks while keeping full-context results mostly intact.

The new element is the on-policy loop that lets the student generate its own partial-information conversations and then aligns them to the canonical full-context behavior. That setup is a direct, practical extension of distillation ideas to the incremental-evidence case, and the out-of-domain transfer suggests it may be teaching better grounding rather than task-specific patterns.

The soft spot is the unexamined assumption that the teacher's full-context answers are the right target. If the base model still produces incorrect or assumption-laden answers even with complete evidence, the student simply learns to copy those errors. The abstract supplies no per-example teacher accuracy numbers against ground truth and no ablation on cases where the teacher is wrong, so it is unclear whether the reported gain reflects genuine improvement or just better imitation of the model's own biases. The lack of metric definitions, baseline details, and significance tests in the abstract also makes the 32% figure hard to assess.

This is for people working on multi-turn LLM robustness and distillation methods. A reader who wants concrete training ideas for reducing drift from partial information would get value from the approach.

It deserves peer review because the core procedure is simple, the claimed gains are large, and the problem it targets is real, even though the current write-up needs more experimental transparency to be convincing.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes Canonical-Context On-Policy Distillation (CCOPD) to address self-anchored drift in LLMs, where models solve tasks with full prompts but fail when the same evidence is revealed incrementally across conversation turns. CCOPD uses the base model as both a frozen teacher (conditioned on the clean FULL prompt) and a trainable student (receiving the evidence sharded across turns); the student is aligned on-policy to the teacher's canonical outputs on its own generated trajectories. Trained only on math problem conversations, the method is reported to yield a 32% average relative improvement in RAW-SHARDED performance over the base model across math and five zero-shot out-of-domain task families while largely preserving full-context performance.

Significance. If the empirical claims hold under detailed scrutiny, the result would be moderately significant for multi-turn LLM consistency: it offers a training-time intervention that improves sharded-evidence robustness without architectural changes or new data sources, and the reported zero-shot transfer from math-only training is noteworthy. The approach directly targets a practical failure mode (drift from partial information) and could inform distillation techniques more broadly, though its impact depends on whether the gains are robust to variations in base models and task distributions.

major comments (3)

[Abstract] Abstract: the central claim of a 32% average relative improvement in RAW-SHARDED performance is presented without any information on the concrete metrics, the precise definition or computation of RAW-SHARDED, the baselines, the number of evaluation examples per task family, statistical significance testing, or data exclusion criteria. These omissions make it impossible to assess whether the reported gain supports the method's effectiveness.
[Method / Experiments (implied by abstract description)] The core assumption that the frozen teacher's full-context answers constitute reliable canonical targets is load-bearing for the claimed benefit. If the base model produces incorrect or assumption-laden answers even when given the complete prompt (a known issue in math reasoning and zero-shot tasks), on-policy distillation will propagate those errors rather than correct drift. The manuscript provides no per-example analysis comparing teacher outputs to ground truth, nor an ablation isolating cases where the teacher itself errs.
[Experiments / Results (implied by abstract)] The generalization claim (math-only training yields gains on five out-of-domain families) is central yet unsupported by any reported controls for task similarity, prompt formatting differences, or whether the out-of-domain tasks were truly zero-shot with respect to the training distribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, providing clarifications from the manuscript and committing to revisions where details or analyses are missing.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 32% average relative improvement in RAW-SHARDED performance is presented without any information on the concrete metrics, the precise definition or computation of RAW-SHARDED, the baselines, the number of evaluation examples per task family, statistical significance testing, or data exclusion criteria. These omissions make it impossible to assess whether the reported gain supports the method's effectiveness.

Authors: We agree the abstract is too condensed. RAW-SHARDED denotes accuracy on multi-turn conversations where the full evidence is provided incrementally (vs. the single full prompt). The 32% figure is the mean relative gain in task-specific metrics (exact match for math, accuracy for others) vs. the base model baseline. Evaluation details, example counts, significance tests, and exclusion criteria appear in Sections 3-4. We will revise the abstract to define RAW-SHARDED and reference the experimental protocol. revision: yes
Referee: The core assumption that the frozen teacher's full-context answers constitute reliable canonical targets is load-bearing for the claimed benefit. If the base model produces incorrect or assumption-laden answers even when given the complete prompt (a known issue in math reasoning and zero-shot tasks), on-policy distillation will propagate those errors rather than correct drift. The manuscript provides no per-example analysis comparing teacher outputs to ground truth, nor an ablation isolating cases where the teacher itself errs.

Authors: This concern is valid. The approach assumes the full-context teacher yields higher-quality targets than the sharded student. We report that full-context performance is largely preserved after distillation, which indirectly suggests limited degradation. However, the manuscript lacks explicit per-example teacher-vs-ground-truth comparisons or ablations on teacher-error cases. We will add these analyses in the revision, reporting teacher accuracy rates and the impact of distilling from erroneous teacher outputs. revision: yes
Referee: The generalization claim (math-only training yields gains on five out-of-domain families) is central yet unsupported by any reported controls for task similarity, prompt formatting differences, or whether the out-of-domain tasks were truly zero-shot with respect to the training distribution.

Authors: We agree additional controls would strengthen the claim. Training used only math conversations; the five out-of-domain families involve distinct reasoning types with no training overlap and uniform sharding/prompt formatting. We will expand the revision with task-selection criteria, any available similarity metrics between math and out-of-domain distributions, and explicit confirmation of zero-shot status. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training procedure with no derivation chain or self-referential reductions

full rationale

The paper describes CCOPD as a practical distillation procedure: a frozen teacher (base model on full prompt) provides targets, and the student is trained on-policy to match those targets on sharded trajectories. No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced that reduce to the inputs by construction. Performance numbers (e.g., 32% relative gain) are reported from experiments rather than derived; the central claim does not rely on self-citation load-bearing or renaming of known results. The method is self-contained as an empirical intervention whose validity is tested externally via held-out tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that full-context teacher outputs are the desired target and that on-policy trajectories are sufficient for transfer.

pith-pipeline@v0.9.1-grok · 5777 in / 1147 out tokens · 20226 ms · 2026-06-29T07:24:40.115651+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wen- wei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, and Xipeng Qiu. 2024. Can AI assistants know what they don’t know? InPro- ceedings of the 41st International Conference on Ma- chine Learning, volume 235 ofProc...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Jeremy Cole, Michael Zhang, Daniel Gillick, Julian Eisenschlos, Bhuwan Dhingra, and Jacob Eisenstein

work page internal anchor Pith review Pith/arXiv arXiv
[3]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

Selectively answering ambiguous questions. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 530–543, Singapore. Association for Computational Linguistics. Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Clarifymt-bench: Benchmarking and improv- ing multi-turn clarification for conversational large language models.Preprint, arXiv:2512.21120. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Wellec...

work page arXiv 2023
[5]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Large language models are better reasoners with self-verification. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023, pages 2550–2575, Singapore. Association for Com- putational Linguistics. Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao. 2025. CollabLLM:...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

OPSDL: On-Policy Self-Distillation for Long-Context Language Models

Modeling future conversation turns to teach LLMs to ask clarifying questions. InInternational Conference on Learning Representations. Tong Zhang, Peixin Qin, Yang Deng, Chen Huang, Wen- qiang Lei, Junhong Liu, Dingnan Jin, Hongru Liang, and Tat-Seng Chua. 2024b. CLAMBER: A bench- mark of identifying and clarifying ambiguous infor- mation needs in large la...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

(12) Proof

=E y∼P h θ   τ(y)X t=1 dt(y)   ,(11) where dt(y) =D KL(πθ(· |h, y <t)∥π0(· |c, y <t)). (12) Proof. For any terminal answer string y= (y1, . . . , yτ(y) )∈ A , the two autoregressive dis- tributions factor as P h θ (y) = τ(y)Y t=1 πθ(yt |h, y <t),(13) P c 0(y) = τ(y)Y t=1 π0(yt |c, y <t).(14) Substituting these factorizations into sequence- 11 Model Va...
[8]

verified

= X y∈A P h θ (y) log P h θ (y) P c 0(y) =E y∼P h θ   τ(y)X t=1 log πθ(yt |h, y <t) π0(yt |c, y <t)   . (15) Conditioning on a realized prefix y<t under P h θ , the next token yt is distributed as πθ(· |h, y <t). Therefore, E log πθ(yt |h, y <t) π0(yt |c, y <t) y<t =D KL(πθ(· |h, y <t)∥π0(· |c, y <t)). (16) Applying Equation (16) inside Equation (15) ...

2018

[1] [1]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wen- wei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, and Xipeng Qiu. 2024. Can AI assistants know what they don’t know? InPro- ceedings of the 41st International Conference on Ma- chine Learning, volume 235 ofProc...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Jeremy Cole, Michael Zhang, Daniel Gillick, Julian Eisenschlos, Bhuwan Dhingra, and Jacob Eisenstein

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

Selectively answering ambiguous questions. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 530–543, Singapore. Association for Computational Linguistics. Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Clarifymt-bench: Benchmarking and improv- ing multi-turn clarification for conversational large language models.Preprint, arXiv:2512.21120. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Wellec...

work page arXiv 2023

[5] [5]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Large language models are better reasoners with self-verification. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023, pages 2550–2575, Singapore. Association for Com- putational Linguistics. Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao. 2025. CollabLLM:...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

OPSDL: On-Policy Self-Distillation for Long-Context Language Models

Modeling future conversation turns to teach LLMs to ask clarifying questions. InInternational Conference on Learning Representations. Tong Zhang, Peixin Qin, Yang Deng, Chen Huang, Wen- qiang Lei, Junhong Liu, Dingnan Jin, Hongru Liang, and Tat-Seng Chua. 2024b. CLAMBER: A bench- mark of identifying and clarifying ambiguous infor- mation needs in large la...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

(12) Proof

=E y∼P h θ   τ(y)X t=1 dt(y)   ,(11) where dt(y) =D KL(πθ(· |h, y <t)∥π0(· |c, y <t)). (12) Proof. For any terminal answer string y= (y1, . . . , yτ(y) )∈ A , the two autoregressive dis- tributions factor as P h θ (y) = τ(y)Y t=1 πθ(yt |h, y <t),(13) P c 0(y) = τ(y)Y t=1 π0(yt |c, y <t).(14) Substituting these factorizations into sequence- 11 Model Va...

[8] [8]

verified

= X y∈A P h θ (y) log P h θ (y) P c 0(y) =E y∼P h θ   τ(y)X t=1 log πθ(yt |h, y <t) π0(yt |c, y <t)   . (15) Conditioning on a realized prefix y<t under P h θ , the next token yt is distributed as πθ(· |h, y <t). Therefore, E log πθ(yt |h, y <t) π0(yt |c, y <t) y<t =D KL(πθ(· |h, y <t)∥π0(· |c, y <t)). (16) Applying Equation (16) inside Equation (15) ...

2018