D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Dengyang Jiang; Dongyang Liu; Harry Yang; Mingzhe Zheng; Peng Gao; Qilong Wu; Ruoyi Du; Steven Hoi; Xiangpeng Yang; Xin Jin

arxiv: 2605.05204 · v3 · pith:OMZRZRQAnew · submitted 2026-05-06 · 💻 cs.CV

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Dengyang Jiang , Xin Jin , Dongyang Liu , Zanyi Wang , Mingzhe Zheng , Ruoyi Du , Xiangpeng Yang , Qilong Wu

show 4 more authors

Zhen Li Peng Gao Harry Yang Steven Hoi

This is my paper

Pith reviewed 2026-05-20 23:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsself-distillationon-policy learningfew-step inferenceimage generationfine-tuningcontinuous adaptation

0 comments

The pith

Step-distilled diffusion models can be continuously fine-tuned on new concepts without losing their few-step speed by using on-policy self-distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the problem that standard fine-tuning breaks the speed of few-step diffusion models. It introduces D-OPSD to turn supervised fine-tuning into an on-policy self-distillation process. The model generates its own trajectories and then compares a text-only student version against a teacher version that also receives the target image. Alignment happens between the two predicted distributions on those self-generated paths. A reader would care because this would let fast image generators keep adapting to new styles or subjects without retraining from scratch or regaining slow multi-step sampling.

Core claim

The paper claims that modern diffusion models with an LLM or VLM encoder inherit in-context capabilities that allow a teacher conditioned on both text prompt and target image to provide reliable supervision to a text-only student. Training then minimizes the difference between the two predicted distributions over the student's own roll-outs, so the model acquires new concepts and styles while its original few-step inference capacity stays intact.

What carries the argument

On-policy self-distillation, in which the model serves as both teacher (conditioned on text plus target image) and student (text only) and the loss aligns their output distributions on trajectories sampled from the student itself.

If this is right

The model acquires new concepts and styles through continuous supervised fine-tuning.
The original few-step inference performance remains unchanged after tuning.
Training draws on the model's inherited in-context capabilities from its encoder to generate the supervisory signal.
Practical ongoing adaptation of efficient image generators to specific domains becomes feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-distillation pattern could be tested on other conditional generative models that use encoder-based prompts.
Deployed systems might use repeated rounds of this process for gradual personalization without full retraining.
Checking the method on base models without strong encoders would test how far the in-context assumption reaches.

Load-bearing premise

Modern diffusion models inherit enough in-context capabilities from their LLM or VLM encoders that a teacher conditioned on both text and target image can reliably supervise a text-only student.

What would settle it

Apply D-OPSD to a step-distilled model and then check whether high-quality images on new concepts still appear in the original few steps; a clear rise in the number of steps required or a drop in quality on either new or original prompts would show the claim is false.

Figures

Figures reproduced from arXiv: 2605.05204 by Dengyang Jiang, Dongyang Liu, Harry Yang, Mingzhe Zheng, Peng Gao, Qilong Wu, Ruoyi Du, Steven Hoi, Xiangpeng Yang, Xin Jin, Zanyi Wang, Zhen Li.

**Figure 1.** Figure 1: We empirically investigate the visual appearance of generated images when con view at source ↗

**Figure 2.** Figure 2: Method overview. For each training pair, we first pass the prompt alone and the prompt together with the target image through the encoder to obtain 𝑐𝑠 and 𝑐𝑡 , respectively. We then sample a few-step trajectory using the student branch conditioned on 𝑐𝑠 . After that, the teacher and student predict velocities on the same trajectory states, and the student is updated by Equation 7. After training, the teach… view at source ↗

**Figure 3.** Figure 3: Visual comparison between baseline methods and ours finetuned on Z-ImageTurbo under customized training settings. Vanilla SFT training sacrifices the original fewstep capacity, and PSO suffers from the overfitting to training set, whereas our method enables the step-distilled model to continuously learn new concepts while maintaining the few-step capacity. large drops in Quality-S and Aesthetic-S in view at source ↗

**Figure 4.** Figure 4: Visual comparison between baseline methods and ours finetuned on Z-ImageTurbo under full-finetuning settings. SFT and PSO training sacrifices the original few-step capacity, whereas our method enables the step-distilled model to continuously learn to bias target domain while maintaining the few-step capacity as well as the learned knowledge in the original domain. learned knowledge. We conduct training an… view at source ↗

**Figure 5.** Figure 5: Ablation on (a) the different training strategies, and (b) the different way to build teacher model. We report the curves across training steps of DINO feature similarity between the generated images and the targets, as well as the Quality Score of the generated images. Training conducted on Z-Image-Turbo with LoRA. Better to zoom in to check the difference. use the student copy leads to training collapse.… view at source ↗

**Figure 6.** Figure 6: When the teacher model fails to generate images consistent with the concept ID under multimodal condition and therefore cannot provide an effective supervision signal, training will fail. Requirements for teacher capability. The success of D-OPSD is contingent upon the base model’s in-context abilities. In specific, as shown in view at source ↗

**Figure 7.** Figure 7: We also empirically investigate the difference of generated images when con view at source ↗

**Figure 7.** Figure 7: When the teacher model fails to generate images consistent with the concept ID under multimodal condition and therefore cannot provide an effective supervision signal, training will fail. Requirements for teacher capability. The success of D-OPSD is contingent upon the base model’s in-context abilities. In specific, as shown in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of generated images of ZImage-Turbo conditioned on multimodal feature using Qwen3-VL 4B and Qwen3-VL 4B with LLM part reweighted by Qwen3-4B LM. To address this issue, we replace the weights of the LLM component in Qwen3-VL-4B with those from the more compatible Qwen3-4B, while keeping the ViT and Connector weights unchanged. In this way, we preserve multimodal in-context capability while… view at source ↗

**Figure 8.** Figure 8: We also empirically investigate the difference of generated images when con [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of generated images of ZImage-Turbo conditioned on multimodal feature using Qwen3-VL 4B and Qwen3-VL 4B with LLM part reweighted by Qwen3-4B LM. To address this issue, we replace the weights of the LLM component in Qwen3-VL-4B with those from the more compatible Qwen3-4B, while keeping the ViT and Connector weights unchanged. In this way, we preserve multimodal in-context capability while… view at source ↗

read the original abstract

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D-OPSD gives a workable on-policy self-distillation route for adapting few-step diffusion models to new concepts without losing speed, but the approach stands or falls on whether the multimodal teacher signal stays reliable during updates.

read the letter

The main point for you is that this paper shows how to keep tuning step-distilled generators like FLUX.2-klein on new data by turning the model into its own teacher and student. The student sees only the text prompt while the teacher gets the prompt plus the target image, and training aligns their outputs on the student's own samples. That setup is the concrete novelty here, and it directly tackles the common failure mode where ordinary fine-tuning destroys the few-step sampling behavior. The authors are right to flag that modern diffusion models with LLM or VLM encoders can carry over some in-context flexibility, and using that to create an on-policy loop is a reasonable engineering move. If the experiments later show stable adaptation with preserved inference speed, the method could see use in deployment pipelines that need ongoing customization. The soft spot is the load-bearing assumption that the extra image context in the teacher produces a signal that remains both accurate and compatible with the student's trajectory once parameters start moving. The abstract states this inheritance as a finding but gives no derivation or early stability check, so the risk of gradual drift in the few-step regime is still open. I would want to see ablations that isolate whether the dual-context teacher actually outperforms simpler distillation baselines and whether the original sampling quality holds after several adaptation rounds. This work is aimed at people already shipping or iterating on efficient image generators rather than the broader diffusion community. A practitioner who needs to add styles or concepts to a fixed-step model without retraining from scratch would find the formulation useful to try. The paper deserves a serious referee because the problem is real and the proposed fix is specific enough to evaluate on its own terms.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes D-OPSD, an on-policy self-distillation training paradigm for step-distilled diffusion models. The core idea is that modern diffusion models with LLM/VLM encoders inherit in-context capabilities, allowing the same model to act as both teacher (conditioned on multimodal input consisting of the text prompt plus target image) and student (conditioned on text features only). Training minimizes divergence between the two predicted distributions evaluated on the student's own roll-outs, with the goal of enabling supervised fine-tuning for new concepts and styles while preserving the original few-step sampling behavior.

Significance. If the central claim is substantiated, the result would be significant for the ongoing shift toward efficient few-step diffusion models. It offers a potential solution to the problem of continuous supervised fine-tuning without degrading inference speed, which is a practical barrier for models such as Z-Image-Turbo and FLUX.2-klein. The on-policy self-distillation framing could also inform related work on self-supervised adaptation in generative models more broadly.

major comments (2)

[Abstract and §3] Abstract and §3 (Method description): The central assumption that the LLM/VLM encoder supplies sufficient in-context capabilities for the multimodal teacher to produce a reliable, distribution-compatible supervisory signal for the text-only student is stated as a finding but is not accompanied by any derivation, preliminary ablation, or stability argument showing why this conditioning remains valid once parameters are updated on-policy. This assumption is load-bearing for the claim that few-step capacity is preserved.
[§4] §4 (Experiments): No quantitative results, ablations, or comparisons are referenced that isolate the contribution of the on-policy self-distillation objective versus standard supervised fine-tuning; without such evidence it is impossible to assess whether the method actually avoids the distribution shift or capacity erosion described in the abstract.

minor comments (1)

[Abstract] Abstract: The models Z-Image-Turbo and FLUX.2-klein are mentioned without citations or references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method description): The central assumption that the LLM/VLM encoder supplies sufficient in-context capabilities for the multimodal teacher to produce a reliable, distribution-compatible supervisory signal for the text-only student is stated as a finding but is not accompanied by any derivation, preliminary ablation, or stability argument showing why this conditioning remains valid once parameters are updated on-policy. This assumption is load-bearing for the claim that few-step capacity is preserved.

Authors: We agree that the validity of the in-context capability after on-policy updates is a central assumption. The manuscript presents this as an empirical observation that enables the teacher-student formulation, with the on-policy rollouts intended to keep the distributions aligned. While no formal derivation is given, the design minimizes shift by supervising on the student's own trajectories. To strengthen the presentation, we will add a new preliminary ablation in the revised §3 that measures the divergence between teacher and student predictions on held-out student trajectories both before and after a short training run, providing evidence for stability of the supervisory signal. revision: yes
Referee: [§4] §4 (Experiments): No quantitative results, ablations, or comparisons are referenced that isolate the contribution of the on-policy self-distillation objective versus standard supervised fine-tuning; without such evidence it is impossible to assess whether the method actually avoids the distribution shift or capacity erosion described in the abstract.

Authors: We acknowledge that the current experiments do not include an explicit head-to-head comparison against standard supervised fine-tuning. The reported results focus on demonstrating that D-OPSD enables acquisition of new concepts while retaining few-step sampling speed. To isolate the contribution of the on-policy objective, we will add quantitative comparisons in the revised §4, including a standard SFT baseline with metrics on both concept fidelity and inference-step preservation, allowing direct assessment of distribution-shift mitigation. revision: yes

Circularity Check

0 steps flagged

No circularity: proposed on-policy objective is independent of its claimed outcomes

full rationale

The paper introduces D-OPSD by first stating an empirical observation that diffusion models with LLM/VLM encoders inherit in-context capabilities, then defines a training process in which the same model acts as teacher (multimodal conditioning on text + target image) and student (text-only) while minimizing divergence on the student's own roll-outs. This formulation is presented as a novel paradigm to enable supervised fine-tuning without eroding few-step sampling; the benefit of preserving original capacity is an intended empirical result of the objective rather than a quantity that reduces to the inputs by definition or by self-citation. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided description, and the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the encoder's in-context learning transfers to the diffusion model in a way that makes multimodal conditioning a valid teacher signal; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Modern diffusion models where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities.
This finding is invoked to justify formulating training as on-policy self-distillation with different contexts.

pith-pipeline@v0.9.0 · 5796 in / 1283 out tokens · 33631 ms · 2026-05-20T23:14:39.707898+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Training minimizes the two predicted distributions over the student's own roll-outs... student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.