Adversarial Dual On-Policy Distillation from Expressive Teacher

Bo An; Chubin Zhang; Ivor W. Tsang; Jingxuan Wu; Mingcong Lei; Xingrui Yu; Yang You; Zhenglin Wan

arxiv: 2605.27095 · v2 · pith:RJSLWITTnew · submitted 2026-05-26 · 💻 cs.LG

Adversarial Dual On-Policy Distillation from Expressive Teacher

Zhenglin Wan , Jingxuan Wu , Xingrui Yu , Chubin Zhang , Mingcong Lei , Bo An , Ivor W. Tsang , Yang You This is my paper

Pith reviewed 2026-06-29 18:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords on-policy distillationflow matchingimitation learningadversarial trainingrobot controldemonstration learningpolicy optimizationembodied AI

0 comments

The pith

FA-OPD lets a flow-matching teacher co-train with a student policy to supply reward and action signals on visited states for better demonstration-based robot control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix the offline nature of behavioral cloning by introducing on-policy distillation from a teacher learned only from demonstrations. It trains a flow-matching model as the teacher alongside a student MLP, using adversarial dual signals: one for expert-likeness reward to encourage exploration and one for action targets to guide exploitation. This setup allows the policy to generalize to new states while staying close to expert behavior. The method is tested on six benchmarks in navigation, manipulation, and locomotion, where it outperforms baselines and handles noisy or limited data better.

Core claim

The central discovery is that coupling reward distillation for long-horizon optimization with action distillation for local guidance in an adversarial on-policy setup enables a demonstration-trained teacher to provide effective signals during student rollouts, leading to improved performance and robustness in embodied control tasks.

What carries the argument

The adversarial dual on-policy distillation, in which the flow-matching teacher provides both a reward channel over state-action pairs and an action channel at student-visited states.

If this is right

Reward distillation enables generalization beyond point-wise demonstrations.
Action distillation keeps exploration anchored near expert-like behavior.
The approach beats strong baselines on six robot benchmarks.
It exhibits stronger robustness under noisy or limited demonstrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This dual signal mechanism might apply to other imitation learning settings where expert policies are not available at test time.
Combining flow-matching with on-policy methods could address distribution shift in a wider range of control problems.
The method's success suggests that teacher-student co-training can be more effective than fixed teacher distillation in demonstration-only scenarios.

Load-bearing premise

A flow-matching teacher trained solely on demonstrations can provide useful reward and action signals on states encountered by the student during its own rollouts.

What would settle it

Observing that FA-OPD fails to outperform the baselines or loses robustness on the robot navigation, manipulation, and locomotion benchmarks when demonstrations are noisy would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.27095 by Bo An, Chubin Zhang, Ivor W. Tsang, Jingxuan Wu, Mingcong Lei, Xingrui Yu, Yang You, Zhenglin Wan.

**Figure 1.** Figure 1: FA-OPD uses the FM teacher only during training. It distills to the MLP student through two channels: reward distillation supplies expert-likeness scores for PPO, while action distillation supplies dense target actions at student-visited states. Only the MLP student is deployed. most useful when co-trained adversarially with the student rather than pre-trained. 3. Methodology In this work, we propose FA-O… view at source ↗

**Figure 2.** Figure 2: Learning curve of FA-OPD and baselines across 6 environments. ter 107 training steps in the Fetch-pick environment, with varying noise levels defined similarly as in Hand-rotate. We observe similar but more contrasting and compelling results: FA-OPD suffers from minimal and negligible performance degradation as noise levels increase. For baselines, static methods like FP and DP steadily suffer from perform… view at source ↗

**Figure 3.** Figure 3: Performance of all methods in Fetch-pick environment across 6 noisy-levels. policies with the same architecture. In contrast, FA-OPD has significantly lower time cost. The key reason is that the other three methods adopt FM policy that requires multi-step numerical integration for action generation, while FA-OPD’s behavioral policy is a simple MLP-based policy which is as capable as the FM policy in terms… view at source ↗

**Figure 4.** Figure 4: Comparison of four algorithms for online updating FM policies. The left figure (a) shows the training curve of four algorithms in Maze2d, and the right figure (b) compares the computational overhead (time cost) of four algorithms. of new action batches, but is limited to purely value-based reinforcement learning. Alternatively, QSM (Psenka et al., 2025) and DPPO (Ren et al., 2024) use policy gradients to f… view at source ↗

**Figure 5.** Figure 5: Overview of the six evaluation environments. Navigation:(a) Ant-goal tasks a quadruped agent with reaching a target position; (e) Maze2d requires an agent to navigate a 2D maze to a goal location; Locomotion: (d) Hopper requires fast and stable forward locomotion without falling; (f) Walker2d requires fast and stable forward locomotion without falling. Manipulation: (b) Hand-rotate requires dexterous in-ha… view at source ↗

**Figure 6.** Figure 6: Performance of FA-OPD with different β values in Fetch-pick environment. Is “larger β always better”? [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Learning curve of all methods in Hand-rotate environment across 6 noisy-levels [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Learning from demonstrations in embodied control is often cast as behavioral cloning, and recent diffusion or flow-matching policies improve this paradigm by modeling multi-modal expert actions. Yet these methods remain offline supervised learners: the policy is trained only on expert states and receives no corrective signal on the states it actually visits. On-policy distillation (OPD) offers a natural remedy, but standard OPD assumes a strong fixed teacher, which is unavailable in demonstration-only control. We propose \textbf{FA-OPD}, an \emph{adversarial dual on-policy distillation} method in which a Flow Matching (FM) teacher is learned from demonstrations and co-trained with a lightweight MLP student. The teacher provides two complementary signals on student rollouts. The reward channel learns an expert-likeness objective over state-action pairs and drives online exploration through long-horizon policy optimization. The action channel supplies dense local targets at student-visited states, stabilizing exploitation. FA-OPD couples them so that reward distillation enables generalization beyond point-wise demonstrations, while action distillation keeps exploration anchored near expert-like behavior. Across six robot navigation, manipulation, and locomotion benchmarks, FA-OPD beats strong baselines and shows much stronger robustness under noisy or limited demonstrations. Source code: https://github.com/vanzll/FA-OPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FA-OPD adds on-policy dual signals from a learned flow-matching teacher to standard offline imitation, but the abstract gives no numbers or checks on whether the teacher stays reliable outside the demo states.

read the letter

The core idea is straightforward: train a flow-matching model on demonstrations, then use it during student rollouts to supply both an adversarial reward (how expert-like the state-action pair looks) and dense action targets. This turns the usual offline behavioral cloning setup into something closer to on-policy distillation without needing a fixed expert. That coupling is the main new piece relative to prior OPD and diffusion-imitation work.

It handles the offline limitation cleanly on paper by letting the reward channel push exploration while the action channel keeps things anchored. The robustness claim under noisy or limited demos follows from that dual objective.

The soft spot is exactly the one the stress-test flags. The teacher is trained only on demonstration states, yet it has to produce useful signals on states the student actually reaches during online rollouts. Nothing in the abstract shows diagnostics, OOD checks, or ablations that would confirm the flow-matching density does not drift or get hacked. The six-benchmark superiority is stated without metrics, baselines, or statistical detail, so the empirical side cannot be judged yet.

This is for groups already working on embodied imitation or on-policy distillation who want to try the dual-channel trick. If the full paper includes the missing ablations and the numbers hold, it is worth a serious referee. Otherwise the generalization assumption stays untested.

Referee Report

3 major / 0 minor

Summary. The paper proposes FA-OPD, an adversarial dual on-policy distillation method in which a flow-matching teacher is trained from demonstrations and co-trained with a lightweight MLP student policy. The teacher supplies a reward signal (expert-likeness over state-action pairs) for long-horizon policy optimization on student rollouts and dense action targets for stabilization. The approach is claimed to enable generalization beyond demonstration states while remaining anchored near expert behavior, yielding superior performance and robustness on six robot navigation, manipulation, and locomotion benchmarks compared to strong baselines, especially under noisy or limited demonstrations.

Significance. If the empirical claims hold and the teacher generalization assumption is validated, the work would offer a concrete mechanism for combining offline expressive generative models with on-policy corrective signals, addressing a recognized limitation of pure behavioral cloning and fixed-teacher OPD in demonstration-only settings. The dual-channel design (reward for exploration, action for exploitation) is a plausible way to mitigate distribution shift, and the reported robustness gains under limited/noisy data would be of practical interest in embodied control.

major comments (3)

[Abstract] Abstract: the central empirical claim (superiority on six benchmarks plus stronger robustness under noisy/limited demos) is stated without any metrics, baseline names, statistical tests, or ablation results, rendering the primary contribution unevaluable from the provided text.
[Abstract] Abstract and method description: the key assumption that the flow-matching teacher (trained only on demonstrations) supplies reliable, non-misleading reward and action signals on states visited by the student during on-policy rollouts is not supported by any OOD diagnostics, density estimation checks, or ablations; this assumption is load-bearing for both the generalization and robustness claims.
[Abstract] Method description: no equations, objective functions, or training algorithm are supplied for the adversarial coupling of the reward and action channels, preventing assessment of whether the dual objective avoids reward hacking, mode collapse, or circular dependence on the student distribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on clarity and validation. We address each major point below and have revised the manuscript to incorporate additional details, metrics, and analyses where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim (superiority on six benchmarks plus stronger robustness under noisy/limited demos) is stated without any metrics, baseline names, statistical tests, or ablation results, rendering the primary contribution unevaluable from the provided text.

Authors: We agree that the abstract would benefit from more concrete details. In the revised manuscript, we have updated the abstract to include specific performance metrics (e.g., success rates and robustness improvements on the six benchmarks), names of the primary baselines, and references to statistical tests and ablations demonstrating the robustness gains under noisy/limited demonstrations. revision: yes
Referee: [Abstract] Abstract and method description: the key assumption that the flow-matching teacher (trained only on demonstrations) supplies reliable, non-misleading reward and action signals on states visited by the student during on-policy rollouts is not supported by any OOD diagnostics, density estimation checks, or ablations; this assumption is load-bearing for both the generalization and robustness claims.

Authors: This is a fair critique of the load-bearing assumption. While the reported robustness results under noisy and limited data offer indirect empirical support, we did not provide explicit OOD diagnostics. We have added a new analysis subsection with density estimation on student-visited states and an ablation isolating the reward channel to directly validate the teacher's generalization behavior. revision: yes
Referee: [Abstract] Method description: no equations, objective functions, or training algorithm are supplied for the adversarial coupling of the reward and action channels, preventing assessment of whether the dual objective avoids reward hacking, mode collapse, or circular dependence on the student distribution.

Authors: We acknowledge the abstract's brevity omitted these details. The full manuscript already contains the reward and action objectives plus the co-training procedure; we have now added a concise overview of the dual objective and its adversarial coupling to the abstract, along with explicit discussion in the method section of how the design mitigates reward hacking and mode collapse via on-policy anchoring. revision: partial

Circularity Check

0 steps flagged

No circularity: new method combination with external benchmark validation

full rationale

The paper introduces FA-OPD as an empirical combination of a flow-matching teacher (trained on demonstrations) with adversarial dual on-policy distillation signals. No equations, derivations, or 'predictions' are presented that reduce by construction to fitted inputs or self-citations. The central claims rest on reported performance across six robot benchmarks under noisy/limited demos, which are external to any internal fitting loop. The generalization assumption about the teacher on student states is an empirical hypothesis, not a definitional tautology. This matches the default case of a self-contained empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the method rests on standard assumptions of imitation learning and flow matching without introducing new free parameters, axioms, or invented entities that are explicitly fitted or postulated.

axioms (1)

domain assumption A flow-matching model trained on demonstrations can generate useful expert-likeness rewards and action targets on out-of-distribution states visited by a student policy.
This is the core premise enabling the dual-signal mechanism described in the abstract.

pith-pipeline@v0.9.1-grok · 5783 in / 1259 out tokens · 27897 ms · 2026-06-29T18:52:27.807997+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 24 canonical work pages · 12 internal anchors

[1]

R., Geist, M., and Bachem, O

Agarwal, R., Vieillard, N., Zhou, Y ., Stanczyk, P., Garea, S. R., Geist, M., and Bachem, O. On-policy distillation of language models: Learning from self-generated mistakes. arXiv preprint arXiv:2306.13649,

work page arXiv
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

URL https: //arxiv.org/abs/2410.24164. Braun, M., Jaquier, N., Rozo, L., and Asfour, T. Riemannian flow matching policy for robot motion learning. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5144–5151,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

URL https://arxiv.org/abs/2303.04137. DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Fu, J., Luo, K., and Levine, S

URL https: //arxiv.org/abs/2502.06061. Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning.arXiv preprint arXiv:1710.11248,

work page arXiv
[5]

Fujimoto, S., Meger, D., and Precup, D

URL https://openreview.net/ forum?id=px0-N3_KjA. Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. InInterna- tional Conference on Machine Learning, pp. 2052–2062. PMLR,

2052
[6]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

URL https://arxiv.org/abs/2304.10573. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

URL https://arxiv.org/abs/2412.03603. Lai, C.-M., Wang, H.-C., Hsieh, P.-C., Wang, F., Chen, M.- H., and Sun, S.-H. Diffusion-reward adversarial imitation learning.Advances in Neural Information Processing Systems, 37:95456–95487,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Flow Matching for Generative Modeling

Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-GRPO: Train- ing flow matching models via online rl.arXiv preprint arXiv:2505.05470,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

M., Weber, E., Choi, H., Feng, H., and Kanazawa, A

McAllister, D., Ge, S., Yi, B., Kim, C. M., Weber, E., Choi, H., Feng, H., and Kanazawa, A. Flow matching policy gradients.arXiv preprint arXiv:2507.21053,

work page arXiv
[11]

Park, S., Li, Q., and Levine, S

URL https://arxiv.org/abs/ 2602.00743. Park, S., Li, Q., and Levine, S. Flow q-learning.arXiv preprint arXiv:2502.02538,

work page arXiv
[12]

org/abs/2301.10677

URL https://arxiv. org/abs/2301.10677. Peng, X. B., Kanazawa, A., Toyer, S., Abbeel, P., and Levine, S. Variational discriminator bottleneck: Improv- ing imitation learning, inverse RL, and gans by constrain- ing information flow.arXiv preprint arXiv:1810.00821,

work page arXiv
[13]

URL https://arxiv.org/abs/ 2312.11752. Ren, A. Z., Lidard, J., Ankile, L. L., Simeonov, A., Agrawal, P., Majumdar, A., Burchfiel, B., Dai, H., and Simchowitz, M. Diffusion policy policy optimization,

work page arXiv
[14]

Diffusion Policy Policy Optimization

URL https://arxiv.org/abs/2409.00588. Reuss, M., Li, M., Jia, X., and Lioutikov, R. Goal- conditioned imitation learning using score-based diffu- sion policies,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Ross, S., Gordon, G., and Bagnell, D

URL https://arxiv.org/ abs/2304.02532. Ross, S., Gordon, G., and Bagnell, D. A reduction of imita- tion learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference Proceedings,

work page arXiv
[16]

URL https://arxiv.org/abs/ 2310.07896. Szot, A. RL-toolkit. https://github.com/ASzot/ rl-toolkit,

work page arXiv
[17]

Behavioral Cloning from Observation

Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation.arXiv preprint arXiv:1805.01954,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Towers, M., Kwiatkowski, A., Terry, J., Balis, J. U., De Cola, G., Deleu, T., Goul˜ao, M., Kallinteris, A., Krimmel, M., 11 Adversarial Dual On-Policy Distillation from Expressive Teacher KG, A., et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W....

work page internal anchor Pith review Pith/arXiv arXiv
[20]

URL https://arxiv.org/abs/2410. 06151v1. Wan, Z., Gao, A., Yu, X., Chao, P., Song, J., and Ran, M. POI recommendation via multi-objective adversarial imi- tation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 12676–12684, 2025b. doi: 10.1609/aaai.v39i12.33382. Wan, Z., Yu, X., Bossens, D. M., Lyu, Y ., Guo, Q., F...

work page doi:10.1609/aaai.v39i12.33382
[21]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

URL https://arxiv.org/abs/ 2208.06193. Xiao, H., Herman, M., Wagner, J., Ziesche, S., Etesami, J., and Linh, T. H. Wasserstein adversarial imitation learning. arXiv preprint arXiv:1906.08113,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[22]

org/abs/2411.06965

URL https://arxiv. org/abs/2411.06965. Ze, Y ., Zhang, G., Zhang, K., Hu, C., Wang, M., and Xu, H. 3d diffusion policy: Generalizable visuomotor pol- icy learning via simple 3d representations,

work page arXiv
[23]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

URL https://arxiv.org/abs/2403.03954. Zhang, C., Wan, Z., Chen, F., Yang, F., Feng, L., Zhou, Y ., Yu, X., You, Y ., Tsang, I., and An, B. Evolving diffusion and flow matching policies for online reinforcement learn- ing,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Zhang, Q., Liu, Z., Fan, H., Liu, G., Zeng, B., and Liu, S

URL https: //arxiv.org/abs/2409.01083. Zhang, Q., Liu, Z., Fan, H., Liu, G., Zeng, B., and Liu, S. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipula- tion. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 14754–14762, 2025a. Zhang, T., Yu, C., Su, S., and Wang, Y . Reinflow: Fine...

work page arXiv
[25]

uni-modal tasks

URL https://arxiv.org/ abs/2311.13443. 12 Adversarial Dual On-Policy Distillation from Expressive Teacher A. Discussion This sections provides the answers to possible questions about FA-OPD framework and the experiment results in Q&A format below. A.1. Query about Experiment Results Q&A Q. Why the performance improvement of FA-OPD is marginal in Navigatio...

work page arXiv 2011
[26]

yields a principled reward signal for AIRL, rather than a heuristic distance, thus providing theoretical understanding of why a flow-based discriminator outperforms its diffusion-based counterpartgiven the same computational budget. The argument proceeds in three steps: (i) the per- sample FM loss is, up to a data-only constant, the negative of a variatio...

2023
[27]

Operationally, smaller Dist implies a tighter likelihood lower bound, so Dist is a calibrated proxy for the negative log-likelihood under thec-conditioned model

state the same equivalence directly in the flow-matching regime. Operationally, smaller Dist implies a tighter likelihood lower bound, so Dist is a calibrated proxy for the negative log-likelihood under thec-conditioned model. Proposition D.2(AIRL reward as a log density ratio).Let DFM,θ be the Softmax discriminator of Eq. (10). The AIRL log-ratio reward ...

2017
[28]

and DRAIL (Lai et al., 2024), consistent with the empirical gap in Table

2024
[29]

Under the OPD lens of Sec. 2.3, Propositions D.1–D.2 together justify the FM teacher as awell-calibrated OPD scorer: its per-action score is tightly tied to expert log-likelihood, and the bound is tighter than that of any diffusion-based teacher under the same compute budget. E. Implementation Details The experiment environments are customized and adapted...

2024
[30]

All experiments are conducted on a Linux server equipped with four NVIDIA A40 (48GB) GPUs and an AMD EPYC 7543P 32-core CPU

and DRAIL (Lai et al., 2024). All experiments are conducted on a Linux server equipped with four NVIDIA A40 (48GB) GPUs and an AMD EPYC 7543P 32-core CPU. We show the algorithmic and experimental implementation details below. 18 Adversarial Dual On-Policy Distillation from Expressive Teacher E.1. Algorithmic Details E.1.1. CHOICE OFCONDITIONALPROBABILITYP...

2024
[31]

20 Adversarial Dual On-Policy Distillation from Expressive Teacher Table 2.Details of hyperparameters in FA-OPD

They cover the FM-enhanced discriminator, the FM vector field, distance-based reward, and training logistics. 20 Adversarial Dual On-Policy Distillation from Expressive Teacher Table 2.Details of hyperparameters in FA-OPD. Name Value Meaning fm num steps 100 FM time discretization steps (used for t indexing in discriminator and for FM-based generation). d...

1993
[32]

larger β always better

21 Adversarial Dual On-Policy Distillation from Expressive Teacher F.2. Hyperparameter Study The hyperparameter β in Eq. 11 weights the action-distillation term against the reward-distillation term, and therefore controls the trade-off between the two distillation modes. As shown in Figure 6, we conducted an ablation study on β in the Fetch-pick environme...

2084
[33]

Similar conclusions as in Section 4.2 could be derived based on these results. F.4. Controlled comparison of policy heads under a shared learned reward To isolate the policy head from the reward signal, all methods here share thesamelearned reward from our FM-enhanced discriminator; the only varied factor is the policy architecture. We additionally includ...

2024
[34]

realness

2 .(41) At test time, a1 is obtained by numerically integrating the ODE from a0 ∼ N(0, I) . FP is a supervise learning approach to clone the expert behavior. GAIL (Ho & Ermon, 2016).GAIL frames imitation as matching occupancy measures via an adversarial game between policyπ ϕ and discriminatorD ψ: min ψ max ϕ E(s,a)∼ρπϕ logD ψ(s, a) +E (s,a)∼ρE log(1−D ψ(...

2016
[35]

Why the IRL setting is the practically interesting one.A common implicit assumption in much of the online-FM RL literature is that the environment reward is known

is a sibling framework conceptually adjacent to all of these but operates on language-model outputs with a pre-trained teacher, so it is omitted from the table for clarity; FA-OPD can be read as the natural extension of OPD to control with a learned, co-trained teacher. Why the IRL setting is the practically interesting one.A common implicit assumption in...

2011

[1] [1]

R., Geist, M., and Bachem, O

Agarwal, R., Vieillard, N., Zhou, Y ., Stanczyk, P., Garea, S. R., Geist, M., and Bachem, O. On-policy distillation of language models: Learning from self-generated mistakes. arXiv preprint arXiv:2306.13649,

work page arXiv

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

URL https: //arxiv.org/abs/2410.24164. Braun, M., Jaquier, N., Rozo, L., and Asfour, T. Riemannian flow matching policy for robot motion learning. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5144–5151,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

URL https://arxiv.org/abs/2303.04137. DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Fu, J., Luo, K., and Levine, S

URL https: //arxiv.org/abs/2502.06061. Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning.arXiv preprint arXiv:1710.11248,

work page arXiv

[5] [5]

Fujimoto, S., Meger, D., and Precup, D

URL https://openreview.net/ forum?id=px0-N3_KjA. Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. InInterna- tional Conference on Machine Learning, pp. 2052–2062. PMLR,

2052

[6] [6]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

URL https://arxiv.org/abs/2304.10573. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

URL https://arxiv.org/abs/2412.03603. Lai, C.-M., Wang, H.-C., Hsieh, P.-C., Wang, F., Chen, M.- H., and Sun, S.-H. Diffusion-reward adversarial imitation learning.Advances in Neural Information Processing Systems, 37:95456–95487,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Flow Matching for Generative Modeling

Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-GRPO: Train- ing flow matching models via online rl.arXiv preprint arXiv:2505.05470,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

M., Weber, E., Choi, H., Feng, H., and Kanazawa, A

McAllister, D., Ge, S., Yi, B., Kim, C. M., Weber, E., Choi, H., Feng, H., and Kanazawa, A. Flow matching policy gradients.arXiv preprint arXiv:2507.21053,

work page arXiv

[11] [11]

Park, S., Li, Q., and Levine, S

URL https://arxiv.org/abs/ 2602.00743. Park, S., Li, Q., and Levine, S. Flow q-learning.arXiv preprint arXiv:2502.02538,

work page arXiv

[12] [12]

org/abs/2301.10677

URL https://arxiv. org/abs/2301.10677. Peng, X. B., Kanazawa, A., Toyer, S., Abbeel, P., and Levine, S. Variational discriminator bottleneck: Improv- ing imitation learning, inverse RL, and gans by constrain- ing information flow.arXiv preprint arXiv:1810.00821,

work page arXiv

[13] [13]

URL https://arxiv.org/abs/ 2312.11752. Ren, A. Z., Lidard, J., Ankile, L. L., Simeonov, A., Agrawal, P., Majumdar, A., Burchfiel, B., Dai, H., and Simchowitz, M. Diffusion policy policy optimization,

work page arXiv

[14] [14]

Diffusion Policy Policy Optimization

URL https://arxiv.org/abs/2409.00588. Reuss, M., Li, M., Jia, X., and Lioutikov, R. Goal- conditioned imitation learning using score-based diffu- sion policies,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Ross, S., Gordon, G., and Bagnell, D

URL https://arxiv.org/ abs/2304.02532. Ross, S., Gordon, G., and Bagnell, D. A reduction of imita- tion learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference Proceedings,

work page arXiv

[16] [16]

URL https://arxiv.org/abs/ 2310.07896. Szot, A. RL-toolkit. https://github.com/ASzot/ rl-toolkit,

work page arXiv

[17] [17]

Behavioral Cloning from Observation

Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation.arXiv preprint arXiv:1805.01954,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Towers, M., Kwiatkowski, A., Terry, J., Balis, J. U., De Cola, G., Deleu, T., Goul˜ao, M., Kallinteris, A., Krimmel, M., 11 Adversarial Dual On-Policy Distillation from Expressive Teacher KG, A., et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W....

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

URL https://arxiv.org/abs/2410. 06151v1. Wan, Z., Gao, A., Yu, X., Chao, P., Song, J., and Ran, M. POI recommendation via multi-objective adversarial imi- tation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 12676–12684, 2025b. doi: 10.1609/aaai.v39i12.33382. Wan, Z., Yu, X., Bossens, D. M., Lyu, Y ., Guo, Q., F...

work page doi:10.1609/aaai.v39i12.33382

[21] [21]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

URL https://arxiv.org/abs/ 2208.06193. Xiao, H., Herman, M., Wagner, J., Ziesche, S., Etesami, J., and Linh, T. H. Wasserstein adversarial imitation learning. arXiv preprint arXiv:1906.08113,

work page internal anchor Pith review Pith/arXiv arXiv 1906

[22] [22]

org/abs/2411.06965

URL https://arxiv. org/abs/2411.06965. Ze, Y ., Zhang, G., Zhang, K., Hu, C., Wang, M., and Xu, H. 3d diffusion policy: Generalizable visuomotor pol- icy learning via simple 3d representations,

work page arXiv

[23] [23]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

URL https://arxiv.org/abs/2403.03954. Zhang, C., Wan, Z., Chen, F., Yang, F., Feng, L., Zhou, Y ., Yu, X., You, Y ., Tsang, I., and An, B. Evolving diffusion and flow matching policies for online reinforcement learn- ing,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Zhang, Q., Liu, Z., Fan, H., Liu, G., Zeng, B., and Liu, S

URL https: //arxiv.org/abs/2409.01083. Zhang, Q., Liu, Z., Fan, H., Liu, G., Zeng, B., and Liu, S. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipula- tion. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 14754–14762, 2025a. Zhang, T., Yu, C., Su, S., and Wang, Y . Reinflow: Fine...

work page arXiv

[25] [25]

uni-modal tasks

URL https://arxiv.org/ abs/2311.13443. 12 Adversarial Dual On-Policy Distillation from Expressive Teacher A. Discussion This sections provides the answers to possible questions about FA-OPD framework and the experiment results in Q&A format below. A.1. Query about Experiment Results Q&A Q. Why the performance improvement of FA-OPD is marginal in Navigatio...

work page arXiv 2011

[26] [26]

yields a principled reward signal for AIRL, rather than a heuristic distance, thus providing theoretical understanding of why a flow-based discriminator outperforms its diffusion-based counterpartgiven the same computational budget. The argument proceeds in three steps: (i) the per- sample FM loss is, up to a data-only constant, the negative of a variatio...

2023

[27] [27]

Operationally, smaller Dist implies a tighter likelihood lower bound, so Dist is a calibrated proxy for the negative log-likelihood under thec-conditioned model

state the same equivalence directly in the flow-matching regime. Operationally, smaller Dist implies a tighter likelihood lower bound, so Dist is a calibrated proxy for the negative log-likelihood under thec-conditioned model. Proposition D.2(AIRL reward as a log density ratio).Let DFM,θ be the Softmax discriminator of Eq. (10). The AIRL log-ratio reward ...

2017

[28] [28]

and DRAIL (Lai et al., 2024), consistent with the empirical gap in Table

2024

[29] [29]

Under the OPD lens of Sec. 2.3, Propositions D.1–D.2 together justify the FM teacher as awell-calibrated OPD scorer: its per-action score is tightly tied to expert log-likelihood, and the bound is tighter than that of any diffusion-based teacher under the same compute budget. E. Implementation Details The experiment environments are customized and adapted...

2024

[30] [30]

All experiments are conducted on a Linux server equipped with four NVIDIA A40 (48GB) GPUs and an AMD EPYC 7543P 32-core CPU

and DRAIL (Lai et al., 2024). All experiments are conducted on a Linux server equipped with four NVIDIA A40 (48GB) GPUs and an AMD EPYC 7543P 32-core CPU. We show the algorithmic and experimental implementation details below. 18 Adversarial Dual On-Policy Distillation from Expressive Teacher E.1. Algorithmic Details E.1.1. CHOICE OFCONDITIONALPROBABILITYP...

2024

[31] [31]

20 Adversarial Dual On-Policy Distillation from Expressive Teacher Table 2.Details of hyperparameters in FA-OPD

They cover the FM-enhanced discriminator, the FM vector field, distance-based reward, and training logistics. 20 Adversarial Dual On-Policy Distillation from Expressive Teacher Table 2.Details of hyperparameters in FA-OPD. Name Value Meaning fm num steps 100 FM time discretization steps (used for t indexing in discriminator and for FM-based generation). d...

1993

[32] [32]

larger β always better

21 Adversarial Dual On-Policy Distillation from Expressive Teacher F.2. Hyperparameter Study The hyperparameter β in Eq. 11 weights the action-distillation term against the reward-distillation term, and therefore controls the trade-off between the two distillation modes. As shown in Figure 6, we conducted an ablation study on β in the Fetch-pick environme...

2084

[33] [33]

Similar conclusions as in Section 4.2 could be derived based on these results. F.4. Controlled comparison of policy heads under a shared learned reward To isolate the policy head from the reward signal, all methods here share thesamelearned reward from our FM-enhanced discriminator; the only varied factor is the policy architecture. We additionally includ...

2024

[34] [34]

realness

2 .(41) At test time, a1 is obtained by numerically integrating the ODE from a0 ∼ N(0, I) . FP is a supervise learning approach to clone the expert behavior. GAIL (Ho & Ermon, 2016).GAIL frames imitation as matching occupancy measures via an adversarial game between policyπ ϕ and discriminatorD ψ: min ψ max ϕ E(s,a)∼ρπϕ logD ψ(s, a) +E (s,a)∼ρE log(1−D ψ(...

2016

[35] [35]

Why the IRL setting is the practically interesting one.A common implicit assumption in much of the online-FM RL literature is that the environment reward is known

is a sibling framework conceptually adjacent to all of these but operates on language-model outputs with a pre-trained teacher, so it is omitted from the table for clarity; FA-OPD can be read as the natural extension of OPD to control with a learned, co-trained teacher. Why the IRL setting is the practically interesting one.A common implicit assumption in...

2011