arxiv: 2604.24005 · v3 · submitted 2026-04-27 · 💻 cs.LG · cs.AI

Recognition: unknown

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Jiaqi Wang , Wenhao Zhang , Weijie Shi , Yaliang Li , James Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationmulti-turn agentstemporal curriculumKL divergenceerror compoundingtrajectory depthautonomous agentsALFWorld

0 comments

The pith

A temporal curriculum that gradually increases trajectory length stabilizes on-policy distillation for multi-turn agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard on-policy distillation breaks down in multi-turn agent settings because early errors push the student into regions where the teacher's guidance becomes unreliable, causing KL divergence to rise and success rates to fall. The proposed fix introduces a schedule that begins training on short trajectories and steadily lengthens them, keeping the student inside the teacher's effective support for longer. This change produces more stable KL values across training, raises final performance by as much as 18 points on standard benchmarks, and in some cases lets the student exceed the teacher's own results while succeeding on tasks the teacher cannot solve.

Core claim

Vanilla on-policy distillation exhibits Trajectory-Level KL Instability in multi-turn environments because inter-turn error compounding drives the student outside the teacher's reliable distribution; the temporal curriculum that starts with short trajectories and progressively expands their depth directly counters this compounding, yielding lower and more stable KL throughout training together with higher agent success rates.

What carries the argument

The temporal curriculum schedule that controls and expands the maximum trajectory depth presented to the student during distillation.

If this is right

KL divergence stays lower and more stable for the entire training run instead of escalating.
Final agent success rates rise substantially compared with standard on-policy distillation.
The distilled student can outperform the original teacher on some tasks.
The student succeeds on instances where the teacher itself fails.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same progressive-length schedule could be tested with other distillation or reinforcement objectives that suffer from compounding error.
Curriculum control of horizon length may become a standard ingredient when distilling long-horizon agents rather than an optional add-on.
If the mechanism is error compounding, similar instability should appear in any multi-step imitation or RL setting that uses on-policy sampling without depth limits.

Load-bearing premise

The observed KL escalation is caused mainly by inter-turn error compounding and that lengthening trajectories via curriculum will fix it without introducing new instabilities.

What would settle it

Running the same student-teacher pairs on the same benchmarks with the curriculum schedule but still observing rising KL divergence and no gain in success rate.

Figures

Figures reproduced from arXiv: 2604.24005 by James Cheng, Jiaqi Wang, Weijie Shi, Wenhao Zhang, Yaliang Li.

**Figure 1.** Figure 1: (left) In OPD for multi-turn agents, as the number of turns increases, the teacher assigns progressively lower probabilities to tokens in student-generated responses, indicating increasing KL divergence at each turn, rendering the supervision signal unreliable. (right) OPD uses all turns and thus includes compounding errors, whereas TCOD-F2B/B2F progressively expands from short to long trajectories, allev… view at source ↗

**Figure 2.** Figure 2: Trajectory-level KL analysis across different teacher–student pairs on ALFWorld. (a)(b) show that the KL divergence escalates throughout training and task completion rates collapse. (c) shows the large gap between the initial and converged KL divergence during OPD training. (d) reveals the underlying reason: the KL divergence grows with the turn index, indicating compounding error amplification over the tr… view at source ↗

**Figure 3.** Figure 3: Overview of our method TCOD-F2B/B2F. Comparison of vanilla on-policy distillation and TCOD. Left is the OPD, middle is the illustration of TCOD-F2B, and right is TCOD-B2F. k is the linear pacing control the trajectory length. The blue step is executed by the student, and the red step is executed by the teacher with a stop gradient. variants: TCOD-F2B and TCOD-B2F, which explicitly impose step constraints i… view at source ↗

**Figure 4.** Figure 4: Training Dynamics comparison of TCOD and OPD on ALFWorld. (a) and (b) show the success rate and KL divergence, respectively, for Qwen2.5-7B as the student. TCOD maintains a higher success rate and more stable KL divergence. (c) and (d) show the success rate and KL divergence, respectively, for Qwen2.5-1.5B as the student model. TCOD-F2B under η = 3, 6 mitigate the success rate collapse and kl escalation. F… view at source ↗

**Figure 5.** Figure 5: Further Analysis of TCOD-F2B/B2F on ALFWorld. (a)(b) is the average action rounds, advantages during training for Qwen2.5-7B as the student. TCOD effectively reduces the action rounds and achieves faster advantage convergence. (c)(d) is the maximum response length, policy gradient loss during training for Qwen2.5-1.5B as the student model. TCOD mitigates redundant responses while maintaining training stabi… view at source ↗

**Figure 6.** Figure 6: Training time comparison. TCOD is computationally efficient view at source ↗

**Figure 7.** Figure 7: KL Escalation and success rate across Teacher–Student Pairs. We evaluate Qwen3-{0.6B, 1.7B} (teacher: Qwen3-30B-A3B-Instruct) and Qwen2.5-{0.5B, 1.5B} (teacher: Qwen2.5-7B-RL) under vanilla OPD on ALFWorld. 0 25 50 75 100 125 150 175 200 Training Steps 0 200 400 600 800 1000 1200 Rollout Kl Divergence Mean rollout / kl_divergence / mean S:3b_T:7b_rl S:7b_T:7b_rl S:3b_T:30b (a) KL divergence. 0 25 50 75 100… view at source ↗

**Figure 8.** Figure 8: Horizon-Induced KL Escalation across Teacher–Student Pairs. We evaluate Qwen2.5-{3B, 7B} (teacher: Qwen3-30B-A3B-Instruct, Qwen2.5-7B-RL) under vanilla OPD on ALFWorld. in Equation 4. By concentrating the distillation signal on early-turn states at the beginning of training and gradually extending the horizon, the student builds a robust foundation before being exposed to the full trajectory, effectively m… view at source ↗

**Figure 9.** Figure 9: Training dynamics of TCOD-B2F (η = 2), including KL divergence, student action horizon, and success rate, for a Qwen2.5-7B student distilled from a GRPO-trained Qwen2.5-7B teacher on ALFWorld view at source ↗

**Figure 10.** Figure 10: Success rates of TCOD-B2F (η = 2), including train hard (left), valid unseen(middle), and valid seen(right), for a Qwen2.5-7B student distilled from a GRPO-trained Qwen2.5-7B teacher on ALFWorld. ALFWorld Task Prompt Template You are an expert agent operating in the ALFRED Embodied Environment. Your task is to: {task description} Prior to this step, you have already taken {step count} step(s). Below are … view at source ↗

read the original abstract

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule. Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails. Our code is available at https://github.com/kokolerk/TCOD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TCOD names a KL instability from error compounding in multi-turn OPD and counters it with a depth curriculum that delivers measurable gains on agent benchmarks.

read the letter

The paper's main point is that vanilla on-policy distillation runs into trajectory-level KL growth in multi-turn settings because mistakes compound across turns and push the student outside the teacher's reliable support. Their TCOD approach starts the student on short trajectories and gradually lengthens them according to a curriculum schedule, which keeps the supervision signal usable for longer and stabilizes training. This is a targeted, practical observation rather than a broad theoretical claim, and it fits what you see in long-horizon agent work where single-turn assumptions break down. They test it across ALFWorld, WebShop, and ScienceWorld with four different student-teacher pairs, report up to 18-point success-rate lifts over plain OPD, and note cases where the student exceeds the teacher on tasks the teacher itself struggles with. The public code link is a plus for anyone who wants to reproduce or adapt the schedule. The causal link they draw between inter-turn compounding and the observed KL rise is straightforward and matches the mechanism you'd expect. On the softer side, the curriculum expansion schedule itself is a free parameter, and the abstract gives no detail on how sensitive results are to its exact form or whether they ran full ablations on it. The 18-point figure and the surpassing-teacher claim would be easier to weigh with error bars, statistical tests, or more breakdown on which tasks drive the gains. Generalization to teacher-failure cases is interesting but needs clearer controls to separate curriculum effects from other training choices. This is useful reading for people already working on distilling agents or RL for interactive environments; it is not a foundational shift but a concrete tweak that could save time if you're hitting similar instability. I would send it to peer review because the problem is well-motivated, the benchmarks are standard, and the fix is simple enough that referees can focus on the empirical details without needing new theory.

Referee Report

2 major / 2 minor

Summary. The paper identifies Trajectory-Level KL Instability in vanilla on-policy distillation (OPD) for multi-turn autonomous agents, attributing it to inter-turn error compounding that drives the student outside the teacher's support. It proposes TCOD, which applies a temporal curriculum to progressively expand the depth of trajectories used for distillation from short to long according to a schedule. Experiments across four student-teacher pairs on ALFWorld, WebShop, and ScienceWorld benchmarks report that TCOD improves KL stability, yields up to 18-point gains over vanilla OPD, and enables the student to surpass the teacher on some tasks while generalizing to cases where the teacher fails.

Significance. If the reported gains and stability improvements hold under rigorous controls, the work supplies a practical, low-overhead technique for stabilizing on-policy distillation in sequential settings. The curriculum mechanism directly targets the identified compounding issue and could transfer to other multi-turn distillation or RL pipelines; the observation that students can exceed teachers on long trajectories is a useful empirical signal about teacher suboptimality.

major comments (2)

[§4.2, Table 2] §4.2 and Table 2: the claim that TCOD 'mitigates KL escalation' requires explicit comparison of per-turn KL trajectories (not just final values) against vanilla OPD; without these curves or a statistical test on the slope of KL growth, the causal link to the curriculum schedule remains under-supported for the central stability claim.
[§3.3] §3.3: the curriculum expansion schedule is listed as a free parameter; the paper should report sensitivity analysis over at least three different schedules (linear, exponential, task-adaptive) to show that the performance gains are not an artifact of a single tuned schedule.

minor comments (2)

The abstract states 'up to 18 points' but the main text should include per-benchmark, per-pair deltas with standard errors and the exact number of seeds used.
Figure 3 (KL stability plots) would benefit from shaded standard-error bands and a direct overlay of the vanilla OPD baseline on the same axes for immediate visual comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of our work on TCOD for stabilizing on-policy distillation in multi-turn autonomous agents. We address each major comment below and will revise the manuscript accordingly to strengthen the evidence for our claims.

read point-by-point responses

Referee: [§4.2, Table 2] §4.2 and Table 2: the claim that TCOD 'mitigates KL escalation' requires explicit comparison of per-turn KL trajectories (not just final values) against vanilla OPD; without these curves or a statistical test on the slope of KL growth, the causal link to the curriculum schedule remains under-supported for the central stability claim.

Authors: We agree that per-turn KL trajectories and a statistical analysis of growth slopes would provide stronger support for the stability claim. The current manuscript reports final KL values in Table 2 along with success rates, but to directly address this point we will add new figures in the revised version showing the evolution of per-turn KL divergence over training steps for TCOD versus vanilla OPD on all three benchmarks. We will also include a linear regression analysis on the slopes of KL growth to quantify the reduction in escalation, thereby better linking the observed stability to the temporal curriculum mechanism. revision: yes
Referee: [§3.3] §3.3: the curriculum expansion schedule is listed as a free parameter; the paper should report sensitivity analysis over at least three different schedules (linear, exponential, task-adaptive) to show that the performance gains are not an artifact of a single tuned schedule.

Authors: We acknowledge that the expansion schedule is a hyperparameter in TCOD. While the main experiments use a linear schedule for its simplicity and progressive nature, we will incorporate sensitivity analysis in the revised manuscript. Specifically, we will evaluate and report results for linear, exponential, and task-adaptive schedules (where adaptation is based on per-task success thresholds) on the ALFWorld and WebShop benchmarks, demonstrating that the performance improvements remain consistent across these variants and are not tied to a single choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper identifies Trajectory-Level KL Instability as an empirical observation in vanilla OPD, attributes it to inter-turn error compounding, and introduces TCOD as a curriculum schedule that progressively expands trajectory depth. No equations, fitted parameters, or derivations are shown that reduce the claimed KL stability gains or performance improvements (up to 18 points) to a quantity defined by the method itself. The results are presented as experimental outcomes on ALFWorld, WebShop, and ScienceWorld benchmarks, with independent validation that the student can exceed the teacher on some tasks. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing for the core mechanism. The derivation chain consists of observation plus empirical intervention and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is populated at a high level from the stated assumptions. The curriculum schedule is the main tunable element; the core premise that error compounding drives the instability is treated as a domain assumption.

free parameters (1)

curriculum expansion schedule
The rate and manner of increasing trajectory depth from short to long is a design choice that must be set to achieve the reported stability gains.

axioms (1)

domain assumption Inter-turn error compounding is the primary driver of Trajectory-Level KL Instability in vanilla on-policy distillation for multi-turn agents
This premise is invoked to motivate the curriculum intervention.

pith-pipeline@v0.9.0 · 5552 in / 1317 out tokens · 58658 ms · 2026-05-08T04:36:02.596349+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

Reference graph

Works this paper leans on

21 extracted references · 19 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Group-in-Group Policy Optimization for LLM Agent Training

GitHub repository. Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978,

work page internal anchor Pith review arXiv
[2]

Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning

Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu, Xinle Deng, Zhizhen Liu, Lei Liang, Huajun Chen, and Wen Zhang. Temp-r1: A unified autonomous agent for complex temporal kgqa via reverse curriculum reinforcement learning.arXiv preprint arXiv:2601.18296,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Stable On-Policy Distillation through Adaptive Target Reformulation

GitHub repository. Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distilla- tion through adaptive target reformulation.arXiv preprint arXiv:2601.07155,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079,

work page arXiv
[5]

arXiv preprint arXiv:2603.11137 , year =

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137,

work page arXiv
[6]

Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, and Jeff Da. Imita- tion learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895,

work page arXiv
[7]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670,

work page internal anchor Pith review arXiv
[8]

https://thinkingmachines.ai/blog/ on-policy-distillation/

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on- policy-distillation. Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868,

work page doi:10.64434/tml.20251026
[9]

GPT-4 Technical Report

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review arXiv
[10]

Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models, 2025

Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models.arXiv preprint arXiv:2505.17826,

work page arXiv
[11]

arXiv preprint arXiv:2602.04942 , year =

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Char- lin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942,

work page arXiv
[12]

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning.ArXiv, abs/2504.05520, 2025

Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520,

work page arXiv
[13]

r3 l: Reflect-then-retry reinforcement learning with language-guided exploration, pivotal credit, and positive amplification.arXiv preprint arXiv:2601.03715,

11 Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, and Yaliang Li. r3 l: Reflect-then-retry reinforcement learning with language-guided exploration, pivotal credit, and positive amplification.arXiv preprint arXiv:2601.03715,

work page arXiv
[14]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C ˆot´e, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interac- tive learning.arXiv preprint arXiv:2010.03768,

work page internal anchor Pith review arXiv 2010
[15]

Think or not? selective reasoning via reinforcement learning for vision-language models.arXiv preprint arXiv:2505.16854, 2025

Jiaqi Wang, Kevin Qinghong Lin, James Cheng, and Mike Zheng Shou. Think or not? selective reasoning via reinforcement learning for vision-language models.arXiv preprint arXiv:2505.16854, 2025a. Ruiyi Wang and Prithviraj Ammanabrolu. A practitioner’s guide to multi-turn agentic reinforcement learning.arXiv preprint arXiv:2510.01132,

work page arXiv
[16]

Science- world: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp

Ruoyao Wang, Peter Jansen, Marc-Alexandre Cˆot´e, and Prithviraj Ammanabrolu. Science- world: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11279–11298,

2022
[17]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165,

work page Pith review arXiv
[18]

DUMP: Automated distribution- level curriculum learning for RL-based LLM post-training.arXiv preprint arXiv:2504.09710, 2025

Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, and Wentian Zhao. Dump: Auto- mated distribution-level curriculum learning for rl-based llm post-training.arXiv preprint arXiv:2504.09710, 2025b. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Hu...

work page arXiv
[19]

On-Policy Context Distillation for Language Models

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022a. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in lan...

work page internal anchor Pith review arXiv
[20]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

work page internal anchor Pith review arXiv
[21]

For a 3B student, training under both a strong 30B teacher and a 7B RL teacher leads to similar outcomes: the KL divergence decreases steadily and the success rate improves at comparable rates, indicating that increasing teacher strength beyond a certain point does not yield additional benefits. In contrast, when the student capacity matches the teacher m...

2000