Recognition: 3 theorem links
· Lean TheoremWhat Does Flow Matching Bring To TD Learning?
Pith reviewed 2026-05-15 16:29 UTC · model grok-4.3
The pith
Flow matching improves TD learning not by modeling return distributions but by using integration to recover from early value errors and dense velocity supervision to keep network features plastic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Flow-matching critics succeed because integration for value readout enables test-time recovery that corrects errors in early estimates, while dense velocity supervision at multiple interpolants induces plastic feature representations that accommodate non-stationary TD targets without discarding prior learning or overfitting to individual targets. This stands in contrast to standard monolithic critics and to distributional RL formulations, both of which lack these mechanisms and therefore underperform in the same high-update-to-data regimes.
What carries the argument
The flow-matching critic that computes values via integration of a learned velocity field and receives dense supervision on that velocity field at many points along each integration path.
If this is right
- Flow-matching critics achieve roughly twice the final performance and five times the sample efficiency of monolithic critics in online RL problems that stress loss of plasticity.
- Learning remains stable even when the number of gradient steps per environment step is large.
- The approach avoids the performance drop that occurs when return distributions are modeled explicitly instead of using scalar integration.
- The same mechanisms allow critics to represent changing TD targets without catastrophic forgetting of earlier features.
Where Pith is reading between the lines
- The same integration-plus-dense-supervision pattern could be grafted onto other value-based methods that currently rely on monolithic critics.
- Plasticity benefits may extend to non-stationary settings outside online RL, such as continual learning or meta-RL.
- Direct measurement of feature drift during training could confirm whether velocity supervision is the primary driver of adaptability.
Load-bearing premise
The observed performance gains are produced specifically by test-time recovery through integration and by plastic feature learning induced by multi-point velocity supervision rather than by incidental details of the flow-matching implementation.
What would settle it
An ablation in which a standard critic is given the same integration-based readout but is trained with only single-point supervision shows no comparable gains in final performance or sample efficiency on high-UTD online RL benchmarks.
read the original abstract
Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emph{plastic} feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2$\times$ in final performance and around 5$\times$ in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that flow matching for scalar Q-value estimation in RL outperforms standard monolithic critics not because it is distributional RL (a negative result is reported), but due to two mechanisms: (1) integration-based readout at test time that enables 'test-time recovery' by iteratively damping early errors, and (2) dense velocity supervision across interpolants that induces more plastic feature learning, allowing better handling of non-stationary TD targets. These are said to yield 2× final performance and ~5× sample efficiency gains in high-UTD online RL settings where loss of plasticity is an issue.
Significance. If the mechanisms are isolated and the empirical gains hold under controlled conditions, the work would provide a useful mechanistic account of why flow-matching critics are more robust than monolithic ones and could guide the design of value approximators that maintain plasticity without distributional overhead.
major comments (2)
- [Empirical results (Section 4)] The central attribution to test-time recovery and plastic feature learning is not isolated. The experiments compare flow-matching critics only against standard monolithic critics; no control applies multi-step integration readout at inference or dense supervision on interpolated targets to a monolithic critic. Without this ablation, the reported 2×/5× gains cannot be confidently attributed to the two proposed mechanisms rather than other differences in parameterization, optimization, or objective curvature.
- [Section 3 and experiments] The negative result on distributional RL (that explicitly modeling return distributions can reduce performance) is presented as evidence against a distributional explanation, but the paper does not report whether the flow-matching formulation was compared against a distributional critic that also uses integration readout and dense supervision. This leaves the contrast incomplete.
minor comments (2)
- [Preliminaries] Notation for the velocity field and interpolant schedule should be introduced earlier and used consistently when describing the two mechanisms.
- [Abstract] The abstract states that the effects are 'formalized'; the main text should explicitly point to the section or appendix containing the formalization.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify that our current experiments do not fully isolate the proposed mechanisms from other differences in parameterization. We address each point below and will revise the manuscript with additional controls and comparisons.
read point-by-point responses
-
Referee: [Empirical results (Section 4)] The central attribution to test-time recovery and plastic feature learning is not isolated. The experiments compare flow-matching critics only against standard monolithic critics; no control applies multi-step integration readout at inference or dense supervision on interpolated targets to a monolithic critic. Without this ablation, the reported 2×/5× gains cannot be confidently attributed to the two proposed mechanisms rather than other differences in parameterization, optimization, or objective curvature.
Authors: We agree that the attribution would be stronger with explicit controls that apply multi-step integration readout and dense supervision on interpolated targets to a monolithic critic. In the revised manuscript we will add these ablations: (1) a monolithic critic trained with an auxiliary loss encouraging consistent predictions across interpolated states, and (2) test-time iterative refinement of the monolithic output. We note that dense velocity supervision is native to the flow-matching objective and cannot be exactly replicated without changing the model class, but the new controls will help quantify how much of the gain is due to the readout and supervision mechanisms versus other factors. revision: yes
-
Referee: [Section 3 and experiments] The negative result on distributional RL (that explicitly modeling return distributions can reduce performance) is presented as evidence against a distributional explanation, but the paper does not report whether the flow-matching formulation was compared against a distributional critic that also uses integration readout and dense supervision. This leaves the contrast incomplete.
Authors: The reported negative result shows that a standard distributional critic underperforms flow matching, indicating the gains are not explained by distributional modeling alone. We acknowledge that the contrast would be more complete if we also evaluated a distributional critic equipped with integration readout and dense supervision. In the revision we will add this comparison (subject to computational feasibility) to directly address whether the mechanisms provide benefits beyond distributional critics. revision: yes
Circularity Check
No significant circularity; empirical mechanisms validated independently
full rationale
The paper proposes two mechanisms (test-time recovery via integration readout and plastic feature learning via dense velocity supervision) to explain flow-matching advantages over monolithic critics in TD learning. These are formalized conceptually and supported by direct empirical comparisons showing 2x final performance and 5x sample efficiency gains in high-UTD regimes, plus explicit tests ruling out distributional RL as the cause. No derivation step reduces a claimed result to a fitted parameter, self-defined quantity, or self-citation chain by construction; the performance deltas are measured outcomes rather than tautological outputs of the inputs. The work remains self-contained against external benchmarks through controlled experimentation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard TD learning update rules and value function approximation hold in the evaluated environments.
Lean theorems connected to this paper
-
Cost/FunctionalEquationwashburn_uniqueness_aczel; dAlembert_to_ODE_general echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
test-time recovery... iterative computation through integration dampens errors... β_K ∝ K^{-c'}... c-conic condition on velocity field: ∂v_θ*/∂z ≤ -c/(1-t)
-
Foundation/ArithmeticFromLogicembed_strictMono_of_one_lt; LogicNat.induction echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
flow-matching can adapt by reweighting existing features... β_t(m) = α_t ∏ (1 + α_k v_k(m))... even when feature directions u_t(m) remain fixed
-
Foundation/RealityFromDistinctionreality_from_one_distinction echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
dense supervision... induces more plastic feature learning... without discarding previously learned features
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Quantile-Coupled Flow Matching for Distributional Reinforcement Learning
FlowIQN is a quantile-coupled CFM critic that yields the first explicit Wasserstein-aligned approximate projection for distributional RL, with improved return-distribution accuracy and competitive offline RL performance.
Reference graph
Works this paper leans on
-
[1]
floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL, 2025
Bhavya Agrawalla, Michal Nauman, Khush Agrawal, and Aviral Kumar. floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL, 2025. URLhttps://arxiv.org/abs/ 2509.06863. 16 What Does Flow Matching Bring To TD Learning?
-
[2]
Building normalizing flows with stochastic interpolants
Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[3]
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization.arXiv preprint arXiv:1802.06509, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.ArXiv, abs/1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
Efficient online reinforcement learning with offline data
Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023
work page 2023
-
[6]
A distributional perspective on reinforcement learning
Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458. JMLR. org, 2017
work page 2017
-
[7]
Quentin Bertrand, Anne Gagneux, Mathurin Massias, and Rémi Emonet. On the closed-form of flow matching: Generalization does not arise from target stochasticity, 2025. URLhttps: //arxiv.org/abs/2506.03719
-
[8]
Dime: Diffusion-based maximum entropy reinforcement learning.arXiv preprint arXiv:2502.02316, 2025
Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. Dime: Diffusion-based maximum entropy reinforcement learning.arXiv preprint arXiv:2502.02316, 2025
-
[9]
Unleashing flow policies with distributional critics.arXiv preprint arXiv:2509.23087, 2025
Deshu Chen, Yuchen Liu, Zhijian Zhou, Chao Qu, and Yuan Qi. Unleashing flow policies with distributional critics.arXiv preprint arXiv:2509.23087, 2025
-
[10]
Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double q-learning: Learning fast without a model. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=AY8zfZm0tDd
work page 2021
-
[11]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023
work page 2023
-
[12]
Distributional Reinforcement Learning with Quantile Regression
Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression.arXiv preprint arXiv:1710.10044, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Implicit Quantile Networks for Distributional Reinforcement Learning
Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning.arXiv preprint arXiv:1806.06923, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Will Dabney, André Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G Bellemare, and David Silver. The value-improvement path: Towards better representations for reinforcement learning.arXiv preprint arXiv:2006.02243, 2020
-
[15]
Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, and Benjamin Eysenbach. Value flows,
- [16]
-
[17]
Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, and Chelsea Finn. Tql: Scaling q-functions with transformers by preventing attention collapse.arXiv preprint arXiv:2602.01439, 2026. 17 What Does Flow Matching Bring To TD Learning?
-
[18]
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Expressive value learning for scalable offline reinforcement learning.arXiv Preprint, 2025
Nicolas Espinosa-Dice, Kiante Brantley, and Wen Sun. Expressive value learning for scalable offline reinforcement learning.arXiv Preprint, 2025
work page 2025
-
[20]
Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025
-
[21]
Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024
-
[22]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Justin Fu, Aviral Kumar, Ofir Nachum, G. Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.ArXiv, abs/2004.07219, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[23]
Addressing function approximation error in actor-critic methods
Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning (ICML), pages 1587–1596, 2018
work page 2018
-
[24]
Hado van Hasselt. Double q-learning. InProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2, 2010
work page 2010
-
[25]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning For Image Recognition.arXiv preprint arXiv:1512.03385, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[26]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020
work page 2020
-
[27]
Marcel Hussing, Claas Voelcker, Igor Gilitschenski, Amir-massoud Farahmand, and Eric Eaton. Dissecting deep rl with high update ratios: Combatting value divergence.arXiv preprint arXiv:2403.05996, 2024
-
[28]
Implicit under-parameterization inhibits data-efficient deep reinforcement learning
Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. InInternational Conference on Learning Repre- sentations, 2021
work page 2021
-
[29]
Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George Tucker, and Sergey Levine. DR3: Value-based deep reinforcement learning requires explicit regularization.International Conference on Learning Representations, 2022
work page 2022
-
[30]
Offline Q- learning on diverse multi-task data both scales and generalizes
Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Offline Q- learning on diverse multi-task data both scales and generalizes. InInternational Conference on Learning Representations, 2023
work page 2023
-
[31]
HojoonLee,YoungdoLee,TakumaSeno,DonghuKim,PeterStone,andJaegulChoo. Hyperspherical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280, 2025
-
[32]
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint, 2020. 18 What Does Flow Matching Bring To TD Learning?
work page 2020
-
[33]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[35]
Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025
Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025
-
[36]
Learning dynamics and generalization in deep reinforcement learning
Clare Lyle, Mark Rowland, Will Dabney, Marta Kwiatkowska, and Yarin Gal. Learning dynamics and generalization in deep reinforcement learning. InInternational Conference on Machine Learning, pages 14560–14581. PMLR, 2022
work page 2022
-
[37]
Understanding plasticity in neural networks
Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. InInternational Conference on Machine Learning, 2023
work page 2023
-
[38]
Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning.Advances in Neural Information Processing Systems, 37:106440–106473, 2024
work page 2024
-
[39]
Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361,
Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025
-
[40]
Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning
Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[41]
Michal Nauman, Michał Bortkiewicz, Piotr Miłoś, Tomasz Trzcinski, Mateusz Ostaszewski, and Marek Cygan. Overestimation, overfitting, and plasticity in actor-critic: The bitter lesson of reinforcement learning. InInternational Conference on Machine Learning, 2024
work page 2024
-
[42]
Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: Scaling for compute and sample-efficient continuous control.Advances in Neural Information Processing Systems, 2024
work page 2024
-
[43]
Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Bigger, regularized, categorical: High-capacity value functions are efficient multi-task learners.arXiv preprint arXiv:2505.23150, 2025
-
[44]
The primacy bias in deep reinforcement learning
Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. InInternational conference on machine learning, pages 16828–16847. PMLR, 2022
work page 2022
-
[45]
Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and André Barreto. Deep reinforcement learning with plasticity injection.Advances in Neural Information Processing Systems, 36:37142–37159, 2023
work page 2023
-
[46]
Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, and Jan Peters. Xqc: Well-conditioned optimization accelerates deep reinforcement learning.arXiv preprint arXiv:2509.25174, 2025. 19 What Does Flow Matching Bring To TD Learning?
-
[47]
Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Permenter, Guannan Qu, Nicholas Boffi, Guanya Shi, and Max Simchowitz. Much ado about noising: Dispelling the myths of generative robotic control. 2025. URLhttps://arxiv.org/ abs/2512.01809
-
[48]
Is value learning really the main bottleneck in offline rl?arXiv preprint arXiv:2406.09329, 2024
Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl?arXiv preprint arXiv:2406.09329, 2024
-
[49]
Ogbench: Benchmarking offline goal-conditioned rl
Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[50]
Flow q-learning.arXiv preprint arXiv:2502.02538,
Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning.arXiv:2502.02538, 2025
-
[51]
D5rl: Diverse datasets for data-driven deep reinforcement learning
Rafael Rafailov, Kyle Beltran Hatch, Anikait Singh, Aviral Kumar, Laura Smith, Ilya Kostrikov, Philippe Hansen-Estruch, Victor Kolev, Philip J Ball, Jiajun Wu, et al. D5rl: Diverse datasets for data-driven deep reinforcement learning. InReinforcement Learning Conference (RLC), 2024
work page 2024
-
[52]
Diffusion Policy Policy Optimization
Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024
work page internal anchor Pith review arXiv 2024
-
[53]
Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Prob- lem
Amrith Setlur, Yuxiao Qu, Matthew Yang, Lunjun Zhang, Virginia Smith, and Avi- ral Kumar. Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Prob- lem. https://blog.ml.cmu.edu/2025/01/08/optimizing-llm-test-time-compute-involves-solving-a- meta-rl-problem/, 2025. CMU MLD Blog
work page 2025
-
[54]
Scaling test-time compute without verification or rl is suboptimal, 2025
Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal, 2025. URLhttps://arxiv.org/abs/2502.12118
- [55]
-
[56]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015
work page 2015
-
[58]
Revisiting the minimalist approach to offline reinforcement learning
Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning. InNeural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[59]
Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025
-
[60]
Diffusion policies as an expressive policy class for offline reinforcement learning
Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning
-
[61]
Shan Zhong, Shutong Ding, He Diao, Xiangyu Wang, Kah Chan Teh, and Bei Peng. Flowcritic: Bridging value estimation with flow matching in reinforcement learning.arXiv:2510.22686, 2025. 20 What Does Flow Matching Bring To TD Learning? Appendices A. Additional Experimental Results Post-layernorm feature norms for flow-matching critics (floq) vs monolithic cr...
-
[62]
the geometry of the exceptional set where contraction fails, and
-
[63]
the discrete Euler trajectory induced by the learned flow. A more refined analysis could therefore proceed by:(1)showing that, with high probability over initialization z∼Unif[𝑙, 𝑢] , the induced trajectory spends only a small fraction of its steps in regions where the conic inequality fails, and(2)controlling the cumulative effect of these rare expansion...
-
[64]
2.(Monolithic predictor.)For𝑓 mono(x;𝑚) =𝑤(𝑚) ⊤xtrained by a squared loss, ˙𝑤(𝑚) =−2(Σ𝑤(𝑚)−𝑏(𝑚))
(Flow matching.)For the flow-matching predictor,˙𝑤eff(𝑚) = ∑︀𝑇−2 ℓ=1 ˙𝛽ℓ(𝑚)𝑢 ℓ, so the predictor can evolve entirely via the dynamics of the gain parameter{˙𝑣𝑘(𝑚)}(Lemma E.3). 2.(Monolithic predictor.)For𝑓 mono(x;𝑚) =𝑤(𝑚) ⊤xtrained by a squared loss, ˙𝑤(𝑚) =−2(Σ𝑤(𝑚)−𝑏(𝑚)). Thus changing any predictions to chase a new target requires˙𝑤(𝑚)̸= 0. When 𝑤(𝑚) is...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.