arxiv: 2603.04333 · v2 · submitted 2026-03-04 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

What Does Flow Matching Bring To TD Learning?

Bhavya Agrawalla , Michal Nauman , Aviral Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords flow matchingtemporal difference learningreinforcement learningvalue function estimationplasticitycritic networksonline RLtest-time recovery

0 comments

The pith

Flow matching improves TD learning not by modeling return distributions but by using integration to recover from early value errors and dense velocity supervision to keep network features plastic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why flow-matching approaches succeed at estimating scalar Q-values in reinforcement learning when standard critics struggle. It demonstrates that the gains arise from two specific mechanisms: reading out values through iterative integration, which damps down mistakes made in early steps, and training the velocity field at many points along each trajectory, which forces the network to maintain adaptable internal features instead of locking onto single TD targets. These effects produce substantially stronger performance than monolithic critics in online RL settings where targets change rapidly and plasticity is easily lost. A reader would care because the result reframes a practical trick as a principled way to make value estimation more robust without altering the surrounding RL algorithm.

Core claim

Flow-matching critics succeed because integration for value readout enables test-time recovery that corrects errors in early estimates, while dense velocity supervision at multiple interpolants induces plastic feature representations that accommodate non-stationary TD targets without discarding prior learning or overfitting to individual targets. This stands in contrast to standard monolithic critics and to distributional RL formulations, both of which lack these mechanisms and therefore underperform in the same high-update-to-data regimes.

What carries the argument

The flow-matching critic that computes values via integration of a learned velocity field and receives dense supervision on that velocity field at many points along each integration path.

If this is right

Flow-matching critics achieve roughly twice the final performance and five times the sample efficiency of monolithic critics in online RL problems that stress loss of plasticity.
Learning remains stable even when the number of gradient steps per environment step is large.
The approach avoids the performance drop that occurs when return distributions are modeled explicitly instead of using scalar integration.
The same mechanisms allow critics to represent changing TD targets without catastrophic forgetting of earlier features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same integration-plus-dense-supervision pattern could be grafted onto other value-based methods that currently rely on monolithic critics.
Plasticity benefits may extend to non-stationary settings outside online RL, such as continual learning or meta-RL.
Direct measurement of feature drift during training could confirm whether velocity supervision is the primary driver of adaptability.

Load-bearing premise

The observed performance gains are produced specifically by test-time recovery through integration and by plastic feature learning induced by multi-point velocity supervision rather than by incidental details of the flow-matching implementation.

What would settle it

An ablation in which a standard critic is given the same integration-based readout but is trained with only single-point supervision shows no comparable gains in final performance or sample efficiency on high-UTD online RL benchmarks.

read the original abstract

Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emph{plastic} feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2$\times$ in final performance and around 5$\times$ in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Flow matching helps TD critics via integration-based error recovery and dense velocity supervision rather than distributional modeling, but the causal link still needs tighter controls.

read the letter

The paper's core claim is that flow-matching critics beat standard ones in high-UTD online RL because integration at readout damps early errors and because supervising velocity across interpolants keeps features plastic under shifting TD targets. They also show that forcing a distributional critic actually hurts, which is a useful negative result. Those two mechanisms are the new part; prior flow-matching RL work did not isolate them this way or test them against plasticity loss specifically. The empirical side is straightforward: 2x final performance and roughly 5x sample efficiency in the regimes where monolithic critics degrade, with the gains holding across the reported settings. That is concrete and worth attention for anyone tuning online RL agents. The main soft spot is the missing control the stress-test flags. The experiments compare full flow-matching critics to ordinary monolithic ones, but do not check whether a standard critic given the same multi-step integration readout at test time or the same dense supervision on interpolated points would close most of the gap. Without that, the performance delta could trace to optimization details, network parameterization, or objective shape instead of the two mechanisms. The formalization of the effects is light but internally consistent with the story they tell. This is for readers working on critic stability in online RL or on function approximation that must track non-stationary targets. It is not a general theory of flow matching in RL, but it gives a practical recipe and a mechanistic hypothesis that can be tested further. I would send it to peer review; the question is real, the evidence is directionally clear, and the gaps are fixable with targeted ablations rather than foundational.

Referee Report

2 major / 2 minor

Summary. The paper claims that flow matching for scalar Q-value estimation in RL outperforms standard monolithic critics not because it is distributional RL (a negative result is reported), but due to two mechanisms: (1) integration-based readout at test time that enables 'test-time recovery' by iteratively damping early errors, and (2) dense velocity supervision across interpolants that induces more plastic feature learning, allowing better handling of non-stationary TD targets. These are said to yield 2× final performance and ~5× sample efficiency gains in high-UTD online RL settings where loss of plasticity is an issue.

Significance. If the mechanisms are isolated and the empirical gains hold under controlled conditions, the work would provide a useful mechanistic account of why flow-matching critics are more robust than monolithic ones and could guide the design of value approximators that maintain plasticity without distributional overhead.

major comments (2)

[Empirical results (Section 4)] The central attribution to test-time recovery and plastic feature learning is not isolated. The experiments compare flow-matching critics only against standard monolithic critics; no control applies multi-step integration readout at inference or dense supervision on interpolated targets to a monolithic critic. Without this ablation, the reported 2×/5× gains cannot be confidently attributed to the two proposed mechanisms rather than other differences in parameterization, optimization, or objective curvature.
[Section 3 and experiments] The negative result on distributional RL (that explicitly modeling return distributions can reduce performance) is presented as evidence against a distributional explanation, but the paper does not report whether the flow-matching formulation was compared against a distributional critic that also uses integration readout and dense supervision. This leaves the contrast incomplete.

minor comments (2)

[Preliminaries] Notation for the velocity field and interpolant schedule should be introduced earlier and used consistently when describing the two mechanisms.
[Abstract] The abstract states that the effects are 'formalized'; the main text should explicitly point to the section or appendix containing the formalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify that our current experiments do not fully isolate the proposed mechanisms from other differences in parameterization. We address each point below and will revise the manuscript with additional controls and comparisons.

read point-by-point responses

Referee: [Empirical results (Section 4)] The central attribution to test-time recovery and plastic feature learning is not isolated. The experiments compare flow-matching critics only against standard monolithic critics; no control applies multi-step integration readout at inference or dense supervision on interpolated targets to a monolithic critic. Without this ablation, the reported 2×/5× gains cannot be confidently attributed to the two proposed mechanisms rather than other differences in parameterization, optimization, or objective curvature.

Authors: We agree that the attribution would be stronger with explicit controls that apply multi-step integration readout and dense supervision on interpolated targets to a monolithic critic. In the revised manuscript we will add these ablations: (1) a monolithic critic trained with an auxiliary loss encouraging consistent predictions across interpolated states, and (2) test-time iterative refinement of the monolithic output. We note that dense velocity supervision is native to the flow-matching objective and cannot be exactly replicated without changing the model class, but the new controls will help quantify how much of the gain is due to the readout and supervision mechanisms versus other factors. revision: yes
Referee: [Section 3 and experiments] The negative result on distributional RL (that explicitly modeling return distributions can reduce performance) is presented as evidence against a distributional explanation, but the paper does not report whether the flow-matching formulation was compared against a distributional critic that also uses integration readout and dense supervision. This leaves the contrast incomplete.

Authors: The reported negative result shows that a standard distributional critic underperforms flow matching, indicating the gains are not explained by distributional modeling alone. We acknowledge that the contrast would be more complete if we also evaluated a distributional critic equipped with integration readout and dense supervision. In the revision we will add this comparison (subject to computational feasibility) to directly address whether the mechanisms provide benefits beyond distributional critics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical mechanisms validated independently

full rationale

The paper proposes two mechanisms (test-time recovery via integration readout and plastic feature learning via dense velocity supervision) to explain flow-matching advantages over monolithic critics in TD learning. These are formalized conceptually and supported by direct empirical comparisons showing 2x final performance and 5x sample efficiency gains in high-UTD regimes, plus explicit tests ruling out distributional RL as the cause. No derivation step reduces a claimed result to a fitted parameter, self-defined quantity, or self-citation chain by construction; the performance deltas are measured outcomes rather than tautological outputs of the inputs. The work remains self-contained against external benchmarks through controlled experimentation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus the two proposed mechanisms being the primary drivers of observed gains.

axioms (1)

domain assumption Standard TD learning update rules and value function approximation hold in the evaluated environments.
Invoked when comparing flow-matching critics to monolithic ones.

pith-pipeline@v0.9.0 · 5521 in / 1104 out tokens · 36387 ms · 2026-05-15T16:29:13.975300+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/FunctionalEquation washburn_uniqueness_aczel; dAlembert_to_ODE_general echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

test-time recovery... iterative computation through integration dampens errors... β_K ∝ K^{-c'}... c-conic condition on velocity field: ∂v_θ*/∂z ≤ -c/(1-t)
Foundation/ArithmeticFromLogic embed_strictMono_of_one_lt; LogicNat.induction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

flow-matching can adapt by reweighting existing features... β_t(m) = α_t ∏ (1 + α_k v_k(m))... even when feature directions u_t(m) remain fixed
Foundation/RealityFromDistinction reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

dense supervision... induces more plastic feature learning... without discarding previously learned features

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Quantile-Coupled Flow Matching for Distributional Reinforcement Learning
cs.LG 2026-05 conditional novelty 7.0

FlowIQN is a quantile-coupled CFM critic that yields the first explicit Wasserstein-aligned approximate projection for distributional RL, with improved return-distribution accuracy and competitive offline RL performance.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL, 2025

Bhavya Agrawalla, Michal Nauman, Khush Agrawal, and Aviral Kumar. floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL, 2025. URLhttps://arxiv.org/abs/ 2509.06863. 16 What Does Flow Matching Bring To TD Learning?

work page arXiv 2025
[2]

Building normalizing flows with stochastic interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[3]

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization.arXiv preprint arXiv:1802.06509, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.ArXiv, abs/1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

work page 2023
[6]

A distributional perspective on reinforcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458. JMLR. org, 2017

work page 2017
[7]

On the closed- form of flow matching: Generalization does not arise from target stochasticity.arXiv preprint arXiv:2506.03719, 2025

Quentin Bertrand, Anne Gagneux, Mathurin Massias, and Rémi Emonet. On the closed-form of flow matching: Generalization does not arise from target stochasticity, 2025. URLhttps: //arxiv.org/abs/2506.03719

work page arXiv 2025
[8]

Dime: Diffusion-based maximum entropy reinforcement learning.arXiv preprint arXiv:2502.02316, 2025

Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. Dime: Diffusion-based maximum entropy reinforcement learning.arXiv preprint arXiv:2502.02316, 2025

work page arXiv 2025
[9]

Unleashing flow policies with distributional critics.arXiv preprint arXiv:2509.23087, 2025

Deshu Chen, Yuchen Liu, Zhijian Zhou, Chao Qu, and Yuan Qi. Unleashing flow policies with distributional critics.arXiv preprint arXiv:2509.23087, 2025

work page arXiv 2025
[10]

Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double q-learning: Learning fast without a model. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=AY8zfZm0tDd

work page 2021
[11]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

work page 2023
[12]

Distributional Reinforcement Learning with Quantile Regression

Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression.arXiv preprint arXiv:1710.10044, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Implicit Quantile Networks for Distributional Reinforcement Learning

Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning.arXiv preprint arXiv:1806.06923, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

The value-improvement path: Towards better representations for reinforcement learning.arXiv preprint arXiv:2006.02243, 2020

Will Dabney, André Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G Bellemare, and David Silver. The value-improvement path: Towards better representations for reinforcement learning.arXiv preprint arXiv:2006.02243, 2020

work page arXiv 2006
[15]

Value flows,

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, and Benjamin Eysenbach. Value flows,

work page
[16]

URLhttps://arxiv.org/abs/2510.07650

work page arXiv
[17]

Tql: Scaling q-functions with transformers by preventing attention collapse.arXiv preprint arXiv:2602.01439, 2026

Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, and Chelsea Finn. Tql: Scaling q-functions with transformers by preventing attention collapse.arXiv preprint arXiv:2602.01439, 2026. 17 What Does Flow Matching Bring To TD Learning?

work page arXiv 2026
[18]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Expressive value learning for scalable offline reinforcement learning.arXiv Preprint, 2025

Nicolas Espinosa-Dice, Kiante Brantley, and Wen Sun. Expressive value learning for scalable offline reinforcement learning.arXiv Preprint, 2025

work page 2025
[20]

Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

work page arXiv 2025
[21]

Stop regressing: Training value functions via clas- sification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

work page arXiv 2024
[22]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, G. Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.ArXiv, abs/2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[23]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning (ICML), pages 1587–1596, 2018

work page 2018
[24]

Double q-learning

Hado van Hasselt. Double q-learning. InProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2, 2010

work page 2010
[25]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning For Image Recognition.arXiv preprint arXiv:1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[26]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[27]

Dissecting deep rl with high update ratios: Combatting value divergence.arXiv preprint arXiv:2403.05996, 2024

Marcel Hussing, Claas Voelcker, Igor Gilitschenski, Amir-massoud Farahmand, and Eric Eaton. Dissecting deep rl with high update ratios: Combatting value divergence.arXiv preprint arXiv:2403.05996, 2024

work page arXiv 2024
[28]

Implicit under-parameterization inhibits data-efficient deep reinforcement learning

Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. InInternational Conference on Learning Repre- sentations, 2021

work page 2021
[29]

DR3: Value-based deep reinforcement learning requires explicit regularization.International Conference on Learning Representations, 2022

Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George Tucker, and Sergey Levine. DR3: Value-based deep reinforcement learning requires explicit regularization.International Conference on Learning Representations, 2022

work page 2022
[30]

Offline Q- learning on diverse multi-task data both scales and generalizes

Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Offline Q- learning on diverse multi-task data both scales and generalizes. InInternational Conference on Learning Representations, 2023

work page 2023
[31]

Hyperspherical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025

HojoonLee,YoungdoLee,TakumaSeno,DonghuKim,PeterStone,andJaegulChoo. Hyperspherical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025

work page arXiv 2025
[32]

Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint, 2020

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint, 2020. 18 What Does Flow Matching Bring To TD Learning?

work page 2020
[33]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[35]

Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

work page arXiv 2025
[36]

Learning dynamics and generalization in deep reinforcement learning

Clare Lyle, Mark Rowland, Will Dabney, Marta Kwiatkowska, and Yarin Gal. Learning dynamics and generalization in deep reinforcement learning. InInternational Conference on Machine Learning, pages 14560–14581. PMLR, 2022

work page 2022
[37]

Understanding plasticity in neural networks

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. InInternational Conference on Machine Learning, 2023

work page 2023
[38]

Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

work page 2024
[39]

Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361,

Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

work page arXiv 2025
[40]

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning

Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[41]

Overestimation, overfitting, and plasticity in actor-critic: The bitter lesson of reinforcement learning

Michal Nauman, Michał Bortkiewicz, Piotr Miłoś, Tomasz Trzcinski, Mateusz Ostaszewski, and Marek Cygan. Overestimation, overfitting, and plasticity in actor-critic: The bitter lesson of reinforcement learning. InInternational Conference on Machine Learning, 2024

work page 2024
[42]

Bigger, regularized, optimistic: Scaling for compute and sample-efficient continuous control.Advances in Neural Information Processing Systems, 2024

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: Scaling for compute and sample-efficient continuous control.Advances in Neural Information Processing Systems, 2024

work page 2024
[43]

Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

work page arXiv 2025
[44]

The primacy bias in deep reinforcement learning

Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. InInternational conference on machine learning, pages 16828–16847. PMLR, 2022

work page 2022
[45]

Deep reinforcement learning with plasticity injection.Advances in Neural Information Processing Systems, 36:37142–37159, 2023

Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and André Barreto. Deep reinforcement learning with plasticity injection.Advances in Neural Information Processing Systems, 36:37142–37159, 2023

work page 2023
[46]

Xqc: Well-conditioned optimization accelerates deep reinforcement learning.arXiv preprint arXiv:2509.25174, 2025

Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, and Jan Peters. Xqc: Well-conditioned optimization accelerates deep reinforcement learning.arXiv preprint arXiv:2509.25174, 2025. 19 What Does Flow Matching Bring To TD Learning?

work page arXiv 2025
[47]

Much ado about noising: Dispelling the myths of generative robotic control.arXiv preprint arXiv:2512.01809,

Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Permenter, Guannan Qu, Nicholas Boffi, Guanya Shi, and Max Simchowitz. Much ado about noising: Dispelling the myths of generative robotic control. 2025. URLhttps://arxiv.org/ abs/2512.01809

work page arXiv 2025
[48]

Is value learning really the main bottleneck in offline rl?arXiv preprint arXiv:2406.09329, 2024

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl?arXiv preprint arXiv:2406.09329, 2024

work page arXiv 2024
[49]

Ogbench: Benchmarking offline goal-conditioned rl

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[50]

Flow q-learning.arXiv preprint arXiv:2502.02538,

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning.arXiv:2502.02538, 2025

work page arXiv 2025
[51]

D5rl: Diverse datasets for data-driven deep reinforcement learning

Rafael Rafailov, Kyle Beltran Hatch, Anikait Singh, Aviral Kumar, Laura Smith, Ilya Kostrikov, Philippe Hansen-Estruch, Victor Kolev, Philip J Ball, Jiajun Wu, et al. D5rl: Diverse datasets for data-driven deep reinforcement learning. InReinforcement Learning Conference (RLC), 2024

work page 2024
[52]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review arXiv 2024
[53]

Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Prob- lem

Amrith Setlur, Yuxiao Qu, Matthew Yang, Lunjun Zhang, Virginia Smith, and Avi- ral Kumar. Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Prob- lem. https://blog.ml.cmu.edu/2025/01/08/optimizing-llm-test-time-compute-involves-solving-a- meta-rl-problem/, 2025. CMU MLD Blog

work page 2025
[54]

Scaling test-time compute without verification or rl is suboptimal, 2025

Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal, 2025. URLhttps://arxiv.org/abs/2502.12118

work page arXiv 2025
[55]

Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, and Aviral Kumar. e3: Learning to explore enables extrapolation of test-time compute for llms, 2025. URLhttps://arxiv.org/abs/2506.09026

work page arXiv 2025
[56]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

work page 2015
[58]

Revisiting the minimalist approach to offline reinforcement learning

Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning. InNeural Information Processing Systems (NeurIPS), 2023

work page 2023
[59]

Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799,

Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

work page arXiv 2025
[60]

Diffusion policies as an expressive policy class for offline reinforcement learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning

work page
[61]

-singletask

Shan Zhong, Shutong Ding, He Diao, Xiangyu Wang, Kah Chan Teh, and Bei Peng. Flowcritic: Bridging value estimation with flow matching in reinforcement learning.arXiv:2510.22686, 2025. 20 What Does Flow Matching Bring To TD Learning? Appendices A. Additional Experimental Results Post-layernorm feature norms for flow-matching critics (floq) vs monolithic cr...

work page arXiv 2025
[62]

the geometry of the exceptional set where contraction fails, and

work page
[63]

the discrete Euler trajectory induced by the learned flow. A more refined analysis could therefore proceed by:(1)showing that, with high probability over initialization z∼Unif[𝑙, 𝑢] , the induced trajectory spends only a small fraction of its steps in regions where the conic inequality fails, and(2)controlling the cumulative effect of these rare expansion...

work page
[64]

2.(Monolithic predictor.)For𝑓 mono(x;𝑚) =𝑤(𝑚) ⊤xtrained by a squared loss, ˙𝑤(𝑚) =−2(Σ𝑤(𝑚)−𝑏(𝑚))

(Flow matching.)For the flow-matching predictor,˙𝑤eff(𝑚) = ∑︀𝑇−2 ℓ=1 ˙𝛽ℓ(𝑚)𝑢 ℓ, so the predictor can evolve entirely via the dynamics of the gain parameter{˙𝑣𝑘(𝑚)}(Lemma E.3). 2.(Monolithic predictor.)For𝑓 mono(x;𝑚) =𝑤(𝑚) ⊤xtrained by a squared loss, ˙𝑤(𝑚) =−2(Σ𝑤(𝑚)−𝑏(𝑚)). Thus changing any predictions to chase a new target requires˙𝑤(𝑚)̸= 0. When 𝑤(𝑚) is...

work page