pith. machine review for the scientific record. sign in

arxiv: 2603.04333 · v2 · submitted 2026-03-04 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

What Does Flow Matching Bring To TD Learning?

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords flow matchingtemporal difference learningreinforcement learningvalue function estimationplasticitycritic networksonline RLtest-time recovery
0
0 comments X

The pith

Flow matching improves TD learning not by modeling return distributions but by using integration to recover from early value errors and dense velocity supervision to keep network features plastic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why flow-matching approaches succeed at estimating scalar Q-values in reinforcement learning when standard critics struggle. It demonstrates that the gains arise from two specific mechanisms: reading out values through iterative integration, which damps down mistakes made in early steps, and training the velocity field at many points along each trajectory, which forces the network to maintain adaptable internal features instead of locking onto single TD targets. These effects produce substantially stronger performance than monolithic critics in online RL settings where targets change rapidly and plasticity is easily lost. A reader would care because the result reframes a practical trick as a principled way to make value estimation more robust without altering the surrounding RL algorithm.

Core claim

Flow-matching critics succeed because integration for value readout enables test-time recovery that corrects errors in early estimates, while dense velocity supervision at multiple interpolants induces plastic feature representations that accommodate non-stationary TD targets without discarding prior learning or overfitting to individual targets. This stands in contrast to standard monolithic critics and to distributional RL formulations, both of which lack these mechanisms and therefore underperform in the same high-update-to-data regimes.

What carries the argument

The flow-matching critic that computes values via integration of a learned velocity field and receives dense supervision on that velocity field at many points along each integration path.

If this is right

  • Flow-matching critics achieve roughly twice the final performance and five times the sample efficiency of monolithic critics in online RL problems that stress loss of plasticity.
  • Learning remains stable even when the number of gradient steps per environment step is large.
  • The approach avoids the performance drop that occurs when return distributions are modeled explicitly instead of using scalar integration.
  • The same mechanisms allow critics to represent changing TD targets without catastrophic forgetting of earlier features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same integration-plus-dense-supervision pattern could be grafted onto other value-based methods that currently rely on monolithic critics.
  • Plasticity benefits may extend to non-stationary settings outside online RL, such as continual learning or meta-RL.
  • Direct measurement of feature drift during training could confirm whether velocity supervision is the primary driver of adaptability.

Load-bearing premise

The observed performance gains are produced specifically by test-time recovery through integration and by plastic feature learning induced by multi-point velocity supervision rather than by incidental details of the flow-matching implementation.

What would settle it

An ablation in which a standard critic is given the same integration-based readout but is trained with only single-point supervision shows no comparable gains in final performance or sample efficiency on high-UTD online RL benchmarks.

read the original abstract

Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emph{plastic} feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2$\times$ in final performance and around 5$\times$ in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that flow matching for scalar Q-value estimation in RL outperforms standard monolithic critics not because it is distributional RL (a negative result is reported), but due to two mechanisms: (1) integration-based readout at test time that enables 'test-time recovery' by iteratively damping early errors, and (2) dense velocity supervision across interpolants that induces more plastic feature learning, allowing better handling of non-stationary TD targets. These are said to yield 2× final performance and ~5× sample efficiency gains in high-UTD online RL settings where loss of plasticity is an issue.

Significance. If the mechanisms are isolated and the empirical gains hold under controlled conditions, the work would provide a useful mechanistic account of why flow-matching critics are more robust than monolithic ones and could guide the design of value approximators that maintain plasticity without distributional overhead.

major comments (2)
  1. [Empirical results (Section 4)] The central attribution to test-time recovery and plastic feature learning is not isolated. The experiments compare flow-matching critics only against standard monolithic critics; no control applies multi-step integration readout at inference or dense supervision on interpolated targets to a monolithic critic. Without this ablation, the reported 2×/5× gains cannot be confidently attributed to the two proposed mechanisms rather than other differences in parameterization, optimization, or objective curvature.
  2. [Section 3 and experiments] The negative result on distributional RL (that explicitly modeling return distributions can reduce performance) is presented as evidence against a distributional explanation, but the paper does not report whether the flow-matching formulation was compared against a distributional critic that also uses integration readout and dense supervision. This leaves the contrast incomplete.
minor comments (2)
  1. [Preliminaries] Notation for the velocity field and interpolant schedule should be introduced earlier and used consistently when describing the two mechanisms.
  2. [Abstract] The abstract states that the effects are 'formalized'; the main text should explicitly point to the section or appendix containing the formalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify that our current experiments do not fully isolate the proposed mechanisms from other differences in parameterization. We address each point below and will revise the manuscript with additional controls and comparisons.

read point-by-point responses
  1. Referee: [Empirical results (Section 4)] The central attribution to test-time recovery and plastic feature learning is not isolated. The experiments compare flow-matching critics only against standard monolithic critics; no control applies multi-step integration readout at inference or dense supervision on interpolated targets to a monolithic critic. Without this ablation, the reported 2×/5× gains cannot be confidently attributed to the two proposed mechanisms rather than other differences in parameterization, optimization, or objective curvature.

    Authors: We agree that the attribution would be stronger with explicit controls that apply multi-step integration readout and dense supervision on interpolated targets to a monolithic critic. In the revised manuscript we will add these ablations: (1) a monolithic critic trained with an auxiliary loss encouraging consistent predictions across interpolated states, and (2) test-time iterative refinement of the monolithic output. We note that dense velocity supervision is native to the flow-matching objective and cannot be exactly replicated without changing the model class, but the new controls will help quantify how much of the gain is due to the readout and supervision mechanisms versus other factors. revision: yes

  2. Referee: [Section 3 and experiments] The negative result on distributional RL (that explicitly modeling return distributions can reduce performance) is presented as evidence against a distributional explanation, but the paper does not report whether the flow-matching formulation was compared against a distributional critic that also uses integration readout and dense supervision. This leaves the contrast incomplete.

    Authors: The reported negative result shows that a standard distributional critic underperforms flow matching, indicating the gains are not explained by distributional modeling alone. We acknowledge that the contrast would be more complete if we also evaluated a distributional critic equipped with integration readout and dense supervision. In the revision we will add this comparison (subject to computational feasibility) to directly address whether the mechanisms provide benefits beyond distributional critics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical mechanisms validated independently

full rationale

The paper proposes two mechanisms (test-time recovery via integration readout and plastic feature learning via dense velocity supervision) to explain flow-matching advantages over monolithic critics in TD learning. These are formalized conceptually and supported by direct empirical comparisons showing 2x final performance and 5x sample efficiency gains in high-UTD regimes, plus explicit tests ruling out distributional RL as the cause. No derivation step reduces a claimed result to a fitted parameter, self-defined quantity, or self-citation chain by construction; the performance deltas are measured outcomes rather than tautological outputs of the inputs. The work remains self-contained against external benchmarks through controlled experimentation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus the two proposed mechanisms being the primary drivers of observed gains.

axioms (1)
  • domain assumption Standard TD learning update rules and value function approximation hold in the evaluated environments.
    Invoked when comparing flow-matching critics to monolithic ones.

pith-pipeline@v0.9.0 · 5521 in / 1104 out tokens · 36387 ms · 2026-05-15T16:29:13.975300+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost/FunctionalEquation washburn_uniqueness_aczel; dAlembert_to_ODE_general echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    test-time recovery... iterative computation through integration dampens errors... β_K ∝ K^{-c'}... c-conic condition on velocity field: ∂v_θ*/∂z ≤ -c/(1-t)

  • Foundation/ArithmeticFromLogic embed_strictMono_of_one_lt; LogicNat.induction echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    flow-matching can adapt by reweighting existing features... β_t(m) = α_t ∏ (1 + α_k v_k(m))... even when feature directions u_t(m) remain fixed

  • Foundation/RealityFromDistinction reality_from_one_distinction echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    dense supervision... induces more plastic feature learning... without discarding previously learned features

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Quantile-Coupled Flow Matching for Distributional Reinforcement Learning

    cs.LG 2026-05 conditional novelty 7.0

    FlowIQN is a quantile-coupled CFM critic that yields the first explicit Wasserstein-aligned approximate projection for distributional RL, with improved return-distribution accuracy and competitive offline RL performance.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL, 2025

    Bhavya Agrawalla, Michal Nauman, Khush Agrawal, and Aviral Kumar. floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL, 2025. URLhttps://arxiv.org/abs/ 2509.06863. 16 What Does Flow Matching Bring To TD Learning?

  2. [2]

    Building normalizing flows with stochastic interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInternational Conference on Learning Representations (ICLR), 2023

  3. [3]

    On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

    Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization.arXiv preprint arXiv:1802.06509, 2018

  4. [4]

    Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.ArXiv, abs/1607.06450, 2016

  5. [5]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  6. [6]

    A distributional perspective on reinforcement learning

    Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458. JMLR. org, 2017

  7. [7]

    On the closed- form of flow matching: Generalization does not arise from target stochasticity.arXiv preprint arXiv:2506.03719, 2025

    Quentin Bertrand, Anne Gagneux, Mathurin Massias, and Rémi Emonet. On the closed-form of flow matching: Generalization does not arise from target stochasticity, 2025. URLhttps: //arxiv.org/abs/2506.03719

  8. [8]

    Dime: Diffusion-based maximum entropy reinforcement learning.arXiv preprint arXiv:2502.02316, 2025

    Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. Dime: Diffusion-based maximum entropy reinforcement learning.arXiv preprint arXiv:2502.02316, 2025

  9. [9]

    Unleashing flow policies with distributional critics.arXiv preprint arXiv:2509.23087, 2025

    Deshu Chen, Yuchen Liu, Zhijian Zhou, Chao Qu, and Yuan Qi. Unleashing flow policies with distributional critics.arXiv preprint arXiv:2509.23087, 2025

  10. [10]

    Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double q-learning: Learning fast without a model. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=AY8zfZm0tDd

  11. [11]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

  12. [12]

    Distributional Reinforcement Learning with Quantile Regression

    Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression.arXiv preprint arXiv:1710.10044, 2017

  13. [13]

    Implicit Quantile Networks for Distributional Reinforcement Learning

    Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning.arXiv preprint arXiv:1806.06923, 2018

  14. [14]

    The value-improvement path: Towards better representations for reinforcement learning.arXiv preprint arXiv:2006.02243, 2020

    Will Dabney, André Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G Bellemare, and David Silver. The value-improvement path: Towards better representations for reinforcement learning.arXiv preprint arXiv:2006.02243, 2020

  15. [15]

    Value flows,

    Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, and Benjamin Eysenbach. Value flows,

  16. [16]

    URLhttps://arxiv.org/abs/2510.07650

  17. [17]

    Tql: Scaling q-functions with transformers by preventing attention collapse.arXiv preprint arXiv:2602.01439, 2026

    Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, and Chelsea Finn. Tql: Scaling q-functions with transformers by preventing attention collapse.arXiv preprint arXiv:2602.01439, 2026. 17 What Does Flow Matching Bring To TD Learning?

  18. [18]

    Gradient Descent Provably Optimizes Over-parameterized Neural Networks

    Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018

  19. [19]

    Expressive value learning for scalable offline reinforcement learning.arXiv Preprint, 2025

    Nicolas Espinosa-Dice, Kiante Brantley, and Wen Sun. Expressive value learning for scalable offline reinforcement learning.arXiv Preprint, 2025

  20. [20]

    Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

    Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

  21. [21]

    Stop regressing: Training value functions via clas- sification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

    Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

  22. [22]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, G. Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.ArXiv, abs/2004.07219, 2020

  23. [23]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning (ICML), pages 1587–1596, 2018

  24. [24]

    Double q-learning

    Hado van Hasselt. Double q-learning. InProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2, 2010

  25. [25]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning For Image Recognition.arXiv preprint arXiv:1512.03385, 2015

  26. [26]

    Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  27. [27]

    Dissecting deep rl with high update ratios: Combatting value divergence.arXiv preprint arXiv:2403.05996, 2024

    Marcel Hussing, Claas Voelcker, Igor Gilitschenski, Amir-massoud Farahmand, and Eric Eaton. Dissecting deep rl with high update ratios: Combatting value divergence.arXiv preprint arXiv:2403.05996, 2024

  28. [28]

    Implicit under-parameterization inhibits data-efficient deep reinforcement learning

    Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. InInternational Conference on Learning Repre- sentations, 2021

  29. [29]

    DR3: Value-based deep reinforcement learning requires explicit regularization.International Conference on Learning Representations, 2022

    Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George Tucker, and Sergey Levine. DR3: Value-based deep reinforcement learning requires explicit regularization.International Conference on Learning Representations, 2022

  30. [30]

    Offline Q- learning on diverse multi-task data both scales and generalizes

    Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Offline Q- learning on diverse multi-task data both scales and generalizes. InInternational Conference on Learning Representations, 2023

  31. [31]

    Hyperspherical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025

    HojoonLee,YoungdoLee,TakumaSeno,DonghuKim,PeterStone,andJaegulChoo. Hyperspherical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025

  32. [32]

    Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint, 2020

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint, 2020. 18 What Does Flow Matching Bring To TD Learning?

  33. [33]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

  34. [34]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  35. [35]

    Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

    Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

  36. [36]

    Learning dynamics and generalization in deep reinforcement learning

    Clare Lyle, Mark Rowland, Will Dabney, Marta Kwiatkowska, and Yarin Gal. Learning dynamics and generalization in deep reinforcement learning. InInternational Conference on Machine Learning, pages 14560–14581. PMLR, 2022

  37. [37]

    Understanding plasticity in neural networks

    Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. InInternational Conference on Machine Learning, 2023

  38. [38]

    Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

    Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024

  39. [39]

    Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361,

    Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

  40. [40]

    Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning

    Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36, 2024

  41. [41]

    Overestimation, overfitting, and plasticity in actor-critic: The bitter lesson of reinforcement learning

    Michal Nauman, Michał Bortkiewicz, Piotr Miłoś, Tomasz Trzcinski, Mateusz Ostaszewski, and Marek Cygan. Overestimation, overfitting, and plasticity in actor-critic: The bitter lesson of reinforcement learning. InInternational Conference on Machine Learning, 2024

  42. [42]

    Bigger, regularized, optimistic: Scaling for compute and sample-efficient continuous control.Advances in Neural Information Processing Systems, 2024

    Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: Scaling for compute and sample-efficient continuous control.Advances in Neural Information Processing Systems, 2024

  43. [43]

    Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

    Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025

  44. [44]

    The primacy bias in deep reinforcement learning

    Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. InInternational conference on machine learning, pages 16828–16847. PMLR, 2022

  45. [45]

    Deep reinforcement learning with plasticity injection.Advances in Neural Information Processing Systems, 36:37142–37159, 2023

    Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and André Barreto. Deep reinforcement learning with plasticity injection.Advances in Neural Information Processing Systems, 36:37142–37159, 2023

  46. [46]

    Xqc: Well-conditioned optimization accelerates deep reinforcement learning.arXiv preprint arXiv:2509.25174, 2025

    Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, and Jan Peters. Xqc: Well-conditioned optimization accelerates deep reinforcement learning.arXiv preprint arXiv:2509.25174, 2025. 19 What Does Flow Matching Bring To TD Learning?

  47. [47]

    Much ado about noising: Dispelling the myths of generative robotic control.arXiv preprint arXiv:2512.01809,

    Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Permenter, Guannan Qu, Nicholas Boffi, Guanya Shi, and Max Simchowitz. Much ado about noising: Dispelling the myths of generative robotic control. 2025. URLhttps://arxiv.org/ abs/2512.01809

  48. [48]

    Is value learning really the main bottleneck in offline rl?arXiv preprint arXiv:2406.09329, 2024

    Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl?arXiv preprint arXiv:2406.09329, 2024

  49. [49]

    Ogbench: Benchmarking offline goal-conditioned rl

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. InInternational Conference on Learning Representations (ICLR), 2025

  50. [50]

    Flow q-learning.arXiv preprint arXiv:2502.02538,

    Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning.arXiv:2502.02538, 2025

  51. [51]

    D5rl: Diverse datasets for data-driven deep reinforcement learning

    Rafael Rafailov, Kyle Beltran Hatch, Anikait Singh, Aviral Kumar, Laura Smith, Ilya Kostrikov, Philippe Hansen-Estruch, Victor Kolev, Philip J Ball, Jiajun Wu, et al. D5rl: Diverse datasets for data-driven deep reinforcement learning. InReinforcement Learning Conference (RLC), 2024

  52. [52]

    Diffusion Policy Policy Optimization

    Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

  53. [53]

    Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Prob- lem

    Amrith Setlur, Yuxiao Qu, Matthew Yang, Lunjun Zhang, Virginia Smith, and Avi- ral Kumar. Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Prob- lem. https://blog.ml.cmu.edu/2025/01/08/optimizing-llm-test-time-compute-involves-solving-a- meta-rl-problem/, 2025. CMU MLD Blog

  54. [54]

    Scaling test-time compute without verification or rl is suboptimal, 2025

    Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal, 2025. URLhttps://arxiv.org/abs/2502.12118

  55. [55]

    Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, and Aviral Kumar. e3: Learning to explore enables extrapolation of test-time compute for llms, 2025. URLhttps://arxiv.org/abs/2506.09026

  56. [56]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

  57. [57]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

  58. [58]

    Revisiting the minimalist approach to offline reinforcement learning

    Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning. InNeural Information Processing Systems (NeurIPS), 2023

  59. [59]

    Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799,

    Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

  60. [60]

    Diffusion policies as an expressive policy class for offline reinforcement learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning

  61. [61]

    -singletask

    Shan Zhong, Shutong Ding, He Diao, Xiangyu Wang, Kah Chan Teh, and Bei Peng. Flowcritic: Bridging value estimation with flow matching in reinforcement learning.arXiv:2510.22686, 2025. 20 What Does Flow Matching Bring To TD Learning? Appendices A. Additional Experimental Results Post-layernorm feature norms for flow-matching critics (floq) vs monolithic cr...

  62. [62]

    the geometry of the exceptional set where contraction fails, and

  63. [63]

    the discrete Euler trajectory induced by the learned flow. A more refined analysis could therefore proceed by:(1)showing that, with high probability over initialization z∼Unif[𝑙, 𝑢] , the induced trajectory spends only a small fraction of its steps in regions where the conic inequality fails, and(2)controlling the cumulative effect of these rare expansion...

  64. [64]

    2.(Monolithic predictor.)For𝑓 mono(x;𝑚) =𝑤(𝑚) ⊤xtrained by a squared loss, ˙𝑤(𝑚) =−2(Σ𝑤(𝑚)−𝑏(𝑚))

    (Flow matching.)For the flow-matching predictor,˙𝑤eff(𝑚) = ∑︀𝑇−2 ℓ=1 ˙𝛽ℓ(𝑚)𝑢 ℓ, so the predictor can evolve entirely via the dynamics of the gain parameter{˙𝑣𝑘(𝑚)}(Lemma E.3). 2.(Monolithic predictor.)For𝑓 mono(x;𝑚) =𝑤(𝑚) ⊤xtrained by a squared loss, ˙𝑤(𝑚) =−2(Σ𝑤(𝑚)−𝑏(𝑚)). Thus changing any predictions to chase a new target requires˙𝑤(𝑚)̸= 0. When 𝑤(𝑚) is...