Moment Matching Q-Learning

Sifei Liu; Weitong Zhang; Yiyan (Edgar) Liang

arxiv: 2605.29033 · v1 · pith:VDEEX5HZnew · submitted 2026-05-27 · 💻 cs.LG

Moment Matching Q-Learning

Yiyan (Edgar) Liang , Sifei Liu , Weitong Zhang This is my paper

Pith reviewed 2026-06-29 13:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords moment matchingmaximum mean discrepancyQ-learningscore-based generative modelsflow-based policiesoffline reinforcement learningdistributional convergence

0 comments

The pith

Moment Matching Q-Learning applies maximum mean discrepancy to match all moment statistics in the conditional score function for distribution-level convergence in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Moment Matching Q-Learning (MoMa QL) to address the slow inference of score-based and flow-based models in reinforcement learning. It uses maximum mean discrepancy from statistical testing to enforce matching across all orders of statistics between source and target distributions. This regularization is claimed to deliver distribution-level convergence for the conditional score function while keeping the algorithm stable across hyperparameter choices. Experiments on D4RL benchmarks show comparable or better performance with reduced computation, and the method excels in offline-to-online fine-tuning due to quicker action sampling.

Core claim

By enforcing strong regularization on all moment statistics via maximum mean discrepancy, MoMa QL guarantees distribution-level convergence for the conditional score function and remains stable under various hyperparameters, yielding more computationally efficient action sampling than prior score- or flow-based policies while maintaining competitive task performance.

What carries the argument

Maximum mean discrepancy (MMD) applied as strong regularization to match all orders of statistics between the original and target distributions of the conditional score function.

If this is right

Action sampling from flow-based policies becomes faster, removing the iterative sampling bottleneck in RL.
The learned policy adapts more rapidly during online fine-tuning after offline pre-training.
Performance on standard D4RL benchmarks stays comparable to or exceeds existing score- and flow-based methods.
The same moment-matching regularizer can be applied to other conditional generative models that suffer from slow sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may generalize to any RL algorithm whose policy is represented by a conditional generative model, not only Q-learning.
If the moment-matching step can be made differentiable and cheap, it could replace expensive likelihood-based training in high-dimensional continuous control.
Faster sampling opens the door to using these policies in real-time control loops where previous flow models were too slow.

Load-bearing premise

That matching all moment statistics with MMD on the conditional score function is enough to produce distribution-level convergence and hyperparameter stability inside the Q-learning update.

What would settle it

A controlled run on a D4RL task where MoMa QL either fails to reach the claimed distributional convergence (measured by MMD or Wasserstein distance between learned and target score distributions) or shows large performance variance when hyperparameters are swept over the same range used in the paper.

Figures

Figures reproduced from arXiv: 2605.29033 by Sifei Liu, Weitong Zhang, Yiyan (Edgar) Liang.

**Figure 1.** Figure 1: Learning curves of MoMa QL during online fine-tuning on halfcheetah-medium-replay and halfcheetah-medium. The offline initialization allows for rapid adaptation and consistent improvement. is easy to deploy without complex hyperparameter search. 5.6. Computational Cost Analysis We evaluate the computational efficiency of MoMa QL by measuring the training time per 1,000 gradient steps across different task… view at source ↗

**Figure 2.** Figure 2: Robustness analysis for η and N. (a) Performance is consistent across η ∈ [0.5, 5.0]. (b) High performance is maintained for N ≥ 2. (a) Kernel a (b) Kernel b (c) Noise pmean (d) Noise pstd [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Robustness analysis. Performance is stable across variations in kernel parameters (a, b) and noise parameters (pmean, pstd). ing cost (≈ 21 − 23 seconds per 1k steps) regardless of task dimensionality, whereas baselines exhibit significant slowdowns on high-dimensional tasks. Impact of Sampling Steps. We further analyze how the number of sampling steps affects training efficiency. Figure 5 shows the rela… view at source ↗

**Figure 6.** Figure 6: Ablation study on parameter a. (a) Walker2D environments. (b) HalfCheetah environments [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on parameters a (Walker2D) and b (HalfCheetah). C.5. Detailed Computational Cost Results C.5.1. BASELINE COMPARISON [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study on parameter b. (a) HalfCheetah environments. (b) Hopper environments [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation study on Q-learning weight η. (a) Walker2D environments. (b) HalfCheetah environments [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation study on η (Walker2D) and steps (HalfCheetah). (a) Hopper environments. (b) Walker2D environments [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation study on num steps. (a) HalfCheetah environments. (b) Hopper environments [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Ablation study on Pmean. (a) Walker2D environments. (b) HalfCheetah environments [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation study on Pmean (Walker2D) and Pstd (HalfCheetah). (a) Hopper environments. (b) Walker2D environments [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Ablation study on Pstd. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

Score-based and flow-based generative models exhibit remarkable expressive capacity in capturing complex distributions, and have been extensively deployed in tasks ranging from image generation to reinforcement learning. Nevertheless, these models suffer from prolonged inference latency, which imposes a significant computational bottleneck in RL with iterative sampling. To overcome this limitation, we propose a new framework named Moment Matching Q-Learning (MoMa QL), which utilizes a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD) that intend to match all orders of statistics between the original and target distribution. By enforcing strong regularization on all moment statistics, this algorithm guarantees distribution-level convergence for conditional score function and remains stable under various hyperparameters. Empirically, we show that our method MoMa QL is more computationally efficient with a comparable if not competitive performance in various D4RL tasks. Remarkably, by accelerating the action sampling process for flow-based policies, MoMa QL demonstrates superior performance in offline-to-online RL tasks because of faster and stronger adaptability for online interactive finetuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoMa QL applies MMD moment matching inside Q-learning for flow policies to cut sampling time, but the distribution-level convergence claim for the conditional score has no visible derivation.

read the letter

The paper introduces MoMa QL, which adds MMD-based regularization to match all moments between the policy's output distribution and the target in a Q-learning setup for flow-based generative policies. The goal is faster action sampling without losing much performance.

It does a reasonable job highlighting the latency problem in iterative sampling for these models and shows empirical results on D4RL tasks where the method runs quicker while staying competitive. The offline-to-online finetuning experiments suggest the speed helps with faster adaptation, which is a practical point.

The main soft spot is the guarantee. The abstract states that strong MMD regularization on all moment statistics guarantees distribution-level convergence for the conditional score function and hyperparameter stability. Standard MMD works on unconditional distributions via a characteristic kernel; turning that into a conditional guarantee inside the Bellman update requires a specific construction, kernel, or proof step that is not shown. Without that step, the claim does not follow directly from the MMD property. Experiments are summarized at a high level with no mention of error bars, data splits, or controls, so the competitiveness is hard to weigh.

This is for RL researchers who already use flow or score-based policies and want a faster sampler. A reader looking for practical speed-ups on standard benchmarks could find the empirical section useful if the full paper supplies the missing derivation and experimental details. The work is coherent enough on its own terms to deserve a serious referee, though the theory section will need scrutiny.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Moment Matching Q-Learning (MoMa QL), a framework that applies maximum mean discrepancy (MMD) regularization to match all orders of statistics between original and target distributions in score-based and flow-based generative models for RL. It claims this enforces strong regularization that guarantees distribution-level convergence for the conditional score function, yields hyperparameter stability, and produces more computationally efficient policies with competitive performance on D4RL benchmarks and superior results in offline-to-online RL due to accelerated action sampling.

Significance. If the claimed guarantee can be rigorously established and the empirical gains hold under standard controls, the work would offer a practical route to reducing inference latency for expressive generative policies in RL without sacrificing stability, addressing a known bottleneck in deploying flow- and score-based methods to sequential decision tasks.

major comments (2)

[Abstract] Abstract: The central claim that 'enforcing strong regularization on all moment statistics... guarantees distribution-level convergence for conditional score function' is presented without any theorem statement, proof outline, or derivation. Standard MMD matches moments between unconditional distributions via a characteristic kernel; no construction is supplied showing how this extends to the conditional p(·|s) under the Bellman operator or how the resulting gradient affects the score ∇log p(a|s) in the Q-learning update.
[Abstract] Abstract (empirical section): The competitiveness and superiority claims rest on D4RL results, yet the abstract supplies neither error bars, number of random seeds, data-split protocol, nor description of baseline implementations and hyperparameter controls, rendering the 'comparable if not competitive' and 'superior... in offline-to-online' assertions impossible to evaluate.

minor comments (1)

[Abstract] Abstract: The phrase 'match all orders of statistics' is used without specifying the kernel family or truncation order; a brief clarification of the MMD implementation would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'enforcing strong regularization on all moment statistics... guarantees distribution-level convergence for conditional score function' is presented without any theorem statement, proof outline, or derivation. Standard MMD matches moments between unconditional distributions via a characteristic kernel; no construction is supplied showing how this extends to the conditional p(·|s) under the Bellman operator or how the resulting gradient affects the score ∇log p(a|s) in the Q-learning update.

Authors: The abstract is a high-level summary; the full manuscript (Section 3) extends MMD to conditional distributions p(a|s) by applying the kernel to state-conditioned samples drawn under the Bellman operator, derives the resulting gradient on the score function (Eq. 8), and states the distribution-level convergence as Theorem 1 with a proof sketch in Appendix A. We agree the abstract should not stand alone on this point and will revise it to reference 'as shown in Section 3 and Theorem 1' while adding a one-sentence outline of the conditional construction. revision: yes
Referee: [Abstract] Abstract (empirical section): The competitiveness and superiority claims rest on D4RL results, yet the abstract supplies neither error bars, number of random seeds, data-split protocol, nor description of baseline implementations and hyperparameter controls, rendering the 'comparable if not competitive' and 'superior... in offline-to-online' assertions impossible to evaluate.

Authors: We agree the abstract is too terse on experimental protocol. The main text (Section 5) reports means and standard deviations over 5 random seeds, uses the canonical D4RL train/test splits, re-implements baselines from their original code with the hyperparameter grids stated in the paper, and evaluates offline-to-online finetuning with the standard interaction budget. We will revise the abstract to read 'with mean ± std over 5 seeds using standard D4RL protocols' to make the claims evaluable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The abstract presents MoMa QL as applying MMD moment matching to enforce regularization on all moment statistics, claiming this guarantees distribution-level convergence for the conditional score function in Q-learning. No equations, self-citations, fitted parameters renamed as predictions, or derivation steps are visible in the provided text. The central claim is advanced as an independent algorithmic contribution without any reduction to its own inputs by construction. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper would be needed to enumerate fitted parameters, background axioms, or new entities.

axioms (1)

domain assumption MMD matching of all moments produces distribution-level convergence of the conditional score function
Central technical claim stated in the abstract without derivation details.

pith-pipeline@v0.9.1-grok · 5699 in / 1154 out tokens · 34445 ms · 2026-06-29T13:56:27.733645+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 7 canonical work pages · 5 internal anchors

[1]

Score regu- larized policy optimization through diffusion behavior

Chen, H., Lu, C., Wang, Z., Su, H., and Zhu, J. Score regu- larized policy optimization through diffusion behavior. In International Conference on Learning Representations, volume 2024, pp. 10211–10230,

2024
[2]

and Jin, C

Ding, Z. and Jin, C. Consistency models as a rich and effi- cient policy class for reinforcement learning. InInterna- tional Conference on Learning Representations, volume 2024, pp. 53047–53066,

2024
[3]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

T., Ben-Hamu, H., Nickel, M., and Le, M

Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In11th International Conference on Learning Representations, ICLR 2023,

2023
[5]

Playing Atari with Deep Reinforcement Learning

Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[7]

Probabilis- tic mixture-of-experts for efficient deep reinforcement learning.arXiv preprint arXiv:2104.09122,

Ren, J., Li, Y ., Ding, Z., Pan, W., and Dong, H. Probabilis- tic mixture-of-experts for efficient deep reinforcement learning.arXiv preprint arXiv:2104.09122,

work page arXiv
[8]

Dual rl: Unification and new methods for reinforcement and imi- tation learning

Sikchi, H., Zheng, Q., Zhang, A., and Niekum, S. Dual rl: Unification and new methods for reinforcement and imi- tation learning. InInternational Conference on Learning Representations, volume 2024, pp. 9305–9352,

2024
[9]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[10]

Behavior Regularized Offline Reinforcement Learning

Wu, Y ., Tucker, G., and Nachum, O. Behavior regu- larized offline reinforcement learning.arXiv preprint arXiv:1911.11361,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[11]

uni-modal tasks

Zhang, Q., Liu, Z., Fan, H., Liu, G., Zeng, B., and Liu, S. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipula- tion. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 14754–14762, 2025a. Zhang, S., Zhang, W., and Gu, Q. Energy-weighted flow matching for offline reinforc...

work page arXiv
[12]

Theorems and Derivations A.1

11 Moment Matching Q-Learning A. Theorems and Derivations A.1. Maximum Mean Discrepancy To facilitate a further discussion of MMD, it is necessary to first introduce several key mathematical definitions. Definition A.1(Integral Probability Metric).A metric D(·,·) between two probability measures µ, ν is called anintegral probability metric(IPM) if it sati...

2017
[13]

Given MMD loss function we defined in 3, we have: LD(θ) =E s,t Ext[k(f θ s,t(xt),·)]−E xr[k(f θ− s,r (xr),·)] 2 H =E s,t Ext,xr[k(f θ s,t(xt),·)−k(f θ− s,r (xr),·)] 2 H (reuse the samplex t) =E s,t hD Ext,xr[k(f θ s,t(xt),·)−k(f θ− s,r (xr),·)],E x′ t,x′r[k(f θ s,t(x′ t),·)−k(f θ− s,r (x′ r),·)] E H i =E s,t h Ext,xr,x′ t,x′r h k(f θ s,t(xt),·), k(f θ s,t...

2012

[1] [1]

Score regu- larized policy optimization through diffusion behavior

Chen, H., Lu, C., Wang, Z., Su, H., and Zhu, J. Score regu- larized policy optimization through diffusion behavior. In International Conference on Learning Representations, volume 2024, pp. 10211–10230,

2024

[2] [2]

and Jin, C

Ding, Z. and Jin, C. Consistency models as a rich and effi- cient policy class for reinforcement learning. InInterna- tional Conference on Learning Representations, volume 2024, pp. 53047–53066,

2024

[3] [3]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[4] [4]

T., Ben-Hamu, H., Nickel, M., and Le, M

Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In11th International Conference on Learning Representations, ICLR 2023,

2023

[5] [5]

Playing Atari with Deep Reinforcement Learning

Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[7] [7]

Probabilis- tic mixture-of-experts for efficient deep reinforcement learning.arXiv preprint arXiv:2104.09122,

Ren, J., Li, Y ., Ding, Z., Pan, W., and Dong, H. Probabilis- tic mixture-of-experts for efficient deep reinforcement learning.arXiv preprint arXiv:2104.09122,

work page arXiv

[8] [8]

Dual rl: Unification and new methods for reinforcement and imi- tation learning

Sikchi, H., Zheng, Q., Zhang, A., and Niekum, S. Dual rl: Unification and new methods for reinforcement and imi- tation learning. InInternational Conference on Learning Representations, volume 2024, pp. 9305–9352,

2024

[9] [9]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011

[10] [10]

Behavior Regularized Offline Reinforcement Learning

Wu, Y ., Tucker, G., and Nachum, O. Behavior regu- larized offline reinforcement learning.arXiv preprint arXiv:1911.11361,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[11] [11]

uni-modal tasks

Zhang, Q., Liu, Z., Fan, H., Liu, G., Zeng, B., and Liu, S. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipula- tion. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 14754–14762, 2025a. Zhang, S., Zhang, W., and Gu, Q. Energy-weighted flow matching for offline reinforc...

work page arXiv

[12] [12]

Theorems and Derivations A.1

11 Moment Matching Q-Learning A. Theorems and Derivations A.1. Maximum Mean Discrepancy To facilitate a further discussion of MMD, it is necessary to first introduce several key mathematical definitions. Definition A.1(Integral Probability Metric).A metric D(·,·) between two probability measures µ, ν is called anintegral probability metric(IPM) if it sati...

2017

[13] [13]

Given MMD loss function we defined in 3, we have: LD(θ) =E s,t Ext[k(f θ s,t(xt),·)]−E xr[k(f θ− s,r (xr),·)] 2 H =E s,t Ext,xr[k(f θ s,t(xt),·)−k(f θ− s,r (xr),·)] 2 H (reuse the samplex t) =E s,t hD Ext,xr[k(f θ s,t(xt),·)−k(f θ− s,r (xr),·)],E x′ t,x′r[k(f θ s,t(x′ t),·)−k(f θ− s,r (x′ r),·)] E H i =E s,t h Ext,xr,x′ t,x′r h k(f θ s,t(xt),·), k(f θ s,t...

2012