arxiv: 2604.19730 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: unknown

FASTER: Value-Guided Sampling for Fast RL

Alexander Swerdlow, Chelsea Finn, Dorsa Sadigh, Perry Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningdiffusion policiesvalue-guided samplingtest-time scalingmanipulation taskssampling efficiency

0 comments

The pith

FASTER models denoising of action candidates as an MDP so a learned value function can filter poor samples early and cut compute in diffusion RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FASTER to capture the performance gains of sampling many actions in diffusion-based RL policies while avoiding the full cost of denoising every candidate to completion. It does this by recasting the entire process of generating and selecting among multiple partially denoised actions as a Markov Decision Process whose states live in the denoising space. A policy and value function trained inside that MDP then decide which candidates to keep and which to discard at each step, guided by predicted downstream returns. Sympathetic readers would care because current high-performing generative RL methods become impractical for robotics and other real-time uses once they rely on repeated full sampling at test time. If the approach works, it supplies a drop-in way to keep the benefits of test-time scaling without redesigning the underlying policy or training loop.

Core claim

FASTER treats the denoising of multiple action candidates together with the selection of the best one as a Markov Decision Process defined directly in the space of partially denoised actions. A value function learned in this MDP predicts the eventual return of each candidate from its current denoising state and enables progressive filtering that discards low-value trajectories before they are fully denoised. The resulting lightweight module plugs into existing generative RL algorithms, improves policy performance on long-horizon manipulation tasks in both online and batch-online regimes, and matches the performance of a pretrained vision-language-action model while substantially lowering the

What carries the argument

The denoising-space MDP that frames progressive filtering of action candidates according to their predicted downstream returns.

If this is right

FASTER improves the underlying policies across challenging long-horizon manipulation tasks in both online and batch-online RL.
It achieves the best overall performance among the compared methods on those tasks.
When applied to a pretrained VLA it reaches the same final performance while reducing both training and inference compute.
The method can be inserted as a lightweight addition into existing generative RL algorithms without changing their training procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-filtering logic could be tested on other iterative generative processes used inside RL, such as autoregressive token models.
Dynamic adjustment of the candidate budget at each denoising step, rather than a fixed number, becomes feasible once value predictions are available.
If the denoising-space value function transfers across tasks, it might reduce the need to retrain large policies from scratch when only inference efficiency is required.

Load-bearing premise

A value function trained inside the denoising-space MDP will rank partially denoised action candidates by their true eventual return without systematic bias introduced by early filtering decisions or by distribution shift between training and test-time trajectories.

What would settle it

An experiment in which FASTER is run on the same tasks but the early-filtered trajectories produce lower returns than full denoising of all candidates, or produce no net reduction in compute for equivalent final performance.

Figures

Figures reproduced from arXiv: 2604.19730 by Alexander Swerdlow, Chelsea Finn, Dorsa Sadigh, Perry Dong.

**Figure 1.** Figure 1: Left: Overview of FASTER. Instead of denoising all N candidates and selecting the best action post-hoc (best-ofN), FASTER learns a denoise critic Qdn that scores action samples during denoising, often directly on the initial noise itself. Right: Performance of FASTER compared with baseline averaged across tasks for π0.5. The FLOPs are normalized on the x-axis against the amount of FLOPs needed for the bas… view at source ↗

**Figure 2.** Figure 2: Action Filtering MDP. We model the process of denoising action candidates and selecting the best one as an MDP where the goal is to filter action samples during denoising while maximizing returns. States. A state st = (s, t, Ct, {a (t) i }i∈Ct ) consists of the environment state s, the denoising timestep t ∈ {T, . . . , 1}, the surviving candidate set Ct ⊆ {1, . . . , N}, and the partially denoised interm… view at source ↗

**Figure 3.** Figure 3: Top: Success rates of FASTER and baselines in the online settings. FASTER-EXPO outperforms strong baselines in sample efficiency. Bottom: Compute comparisons of FASTER-EXPO and EXPO. FASTER eliminates extra denoising during training and inference, yielding large FLOP reductions relative to the EXPO with comparable task performance. Inference Step 0.0× 0.2× 0.4× 0.6× 0.8× 1.0× 1.2× FLOPs (normalized to IDQL… view at source ↗

**Figure 4.** Figure 4: Success rate and compute comparisons of FASTER-IDQL and IDQL in the online setting. FASTER can be applied to IDQL to eliminate extra denoising rollouts at inference while obtaining the same performance in success rates. (see [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Top: Success rate curves of FASTER-EXPO and EXPO in the batch-online setting. FASTER matches the performance of EXPO in iterations. Bottom: Compute comparisons of FASTER-EXPO and EXPO in the batch-online setting. Like in the online setting, in the batch-online setting FASTER-EXPO yields a large FLOP reduction compared to EXPO from not needing to denoise all action samples. RLPD [36]. RLPD is a highly sampl… view at source ↗

**Figure 6.** Figure 6: Training and inference timing of FASTER-EXPO compared to EXPO. FASTER-EXPO achieves 1.7x improvement in inference time and 4.5x improvement in the update step time. The computational burden of sampling-based methods is particularly acute as model scale increases; recent VLAs commonly reach 3B parameters, making the cost of denoising N candidates at every environment step prohibitive for both training an… view at source ↗

**Figure 7.** Figure 7: FASTER-EXPO compared to EXPO on top of π0.5. Top: Performance with environment steps. FASTER-EXPO is competitive in performance compared to EXPO. Bottom: Performance with FLOPs. FASTER-EXPO performs significantly better than EXPO under the same compute as FASTER-EXPO chooses the best action sample without denoising all sampled actions in inference and training. 0 40 80 120 160 200 Env Steps (x1000) 0.0 0.2… view at source ↗

**Figure 8.** Figure 8: Critic-size ablation for FASTER-EXPO on can and square. We compare filtering critics Qdn with parameter counts set to approximately 1.0×, 0.5×, and 0.25× that of Qa. We find that Qdn can be substantially smaller than Qa without degrading performance. 0 30 60 90 120 150 Env Steps (x1000) 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate Can 0 30 60 90 120 150 Env Steps (x1000) 0.0 0.2 0.4 0.6 0.8 1.0 Square t = 0 t = 1… view at source ↗

**Figure 10.** Figure 10: Learned-filter ablation results. Filtering at the initial seed performs comparably to learning the full filtering policy in the MDP. How does the size of the denoise critic affect performance? We investigate whether the capacity of the denoise critic is a significant factor in performance. Specifically, we compare architectures ranging from a smaller network (128 hidden units × 3 layers, 300K params) to … view at source ↗

**Figure 11.** Figure 11: Distillation ablation results. Distilling the value maximizing distribution into a policy performs significantly worse compared to FASTER. B Experiment Details Hyperparameter FASTER-EXPO FASTER-IDQL Actor / critic learning rate 3 × 10−4 / 3 × 10−4 3 × 10−4 / 3 × 10−4 Batch size 256 256 Discount 0.99 0.99 Target update τ 0.005 0.005 UTD ratio 20 20 Candidates N 8 8 Candidates kept after filtering 1 1 Denoi… view at source ↗

read the original abstract

Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The MDP framing for early filtering in diffusion denoising is a fresh modeling move that could cut sampling cost, but the value estimates risk bias from the filtering itself and the abstract gives no numbers to check it.

read the letter

The punchline is that they treat the denoising of several action candidates as an MDP whose actions are keep-or-discard choices at each step, then train a value function inside that process to drop weak candidates before full denoising finishes. This is not how prior diffusion-RL work has structured the problem, so the construction itself is new. It is a clean way to turn the usual expensive final selection into something that can prune early while still aiming at high return. The paper also releases code and shows the method can be dropped into existing online and batch-online RL pipelines, plus a pretrained VLA case where it keeps performance but lowers both training and inference cost. That practical angle is useful for anyone who has hit latency walls with sampling-based policies in robotics. The soft spots sit in the value function. It is trained to predict the final return that the filtering policy then uses, so any mismatch between the trajectories seen during value training and the filtered ones at test time can introduce bias. The stress-test note flags exactly this distribution shift, and nothing in the abstract rules it out or quantifies how large it is. Without ablations, error bars, or even the raw numbers behind the “consistent improvement” claim, it is impossible to tell whether the MDP actually delivers the advertised gains or whether the filtering decisions are quietly discarding good samples. The circularity is moderate rather than fatal, but it is real. This paper is for groups already running diffusion policies on manipulation tasks and looking for cheaper test-time scaling. A reader who wants the modeling idea can extract it quickly; anyone who needs reliable performance numbers will have to wait for the full experiments. I would send it to peer review so the empirical claims and the shift issue get proper scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces FASTER, a method that reformulates the denoising process of multiple action candidates from a diffusion-based policy as a Markov Decision Process (MDP) in denoising space. A policy and value function are learned to progressively filter low-value candidates early in the denoising trajectory, with the value function predicting downstream returns to maximize performance while reducing the number of full denoising steps required. The approach is presented as a lightweight plug-in for existing generative RL algorithms. Empirical claims include consistent policy improvements and best-in-class performance on long-horizon manipulation tasks in both online and batch-online RL settings, plus equivalent performance to a pretrained VLA with substantially lower training and inference compute.

Significance. If the empirical claims hold under the distribution-shift concerns, FASTER would provide a practical mechanism for test-time scaling in diffusion policies without proportional compute cost, which is relevant for robotics and long-horizon control. Code availability supports reproducibility and potential follow-up work.

major comments (2)

[Method (MDP formulation and value-function training)] The central claim rests on the value function accurately ranking partially denoised candidates by eventual return. However, because the filtering policy alters the distribution of denoising trajectories at test time relative to the (unfiltered or differently filtered) trajectories used to train the value function, systematic bias in value estimates is possible. This distribution-shift issue is load-bearing for the filtering decisions and is not addressed by additional analysis or targeted ablations in the experiments section.
[Experiments] Table 1 and Figure 3 (performance comparisons): while the paper states that FASTER achieves the best overall performance, the reported gains are not accompanied by statistical significance tests or sufficient error analysis across seeds to confirm that improvements are not attributable to variance in the underlying base policies.

minor comments (2)

[Abstract] The abstract would benefit from one or two key quantitative results (e.g., success-rate deltas or compute-reduction factors) to allow readers to assess the magnitude of the claimed improvements without reading the full experiments section.
[Background / Method] Notation for the denoising-space state and action spaces is introduced without an explicit comparison table to the original MDP; a small diagram or table would improve clarity for readers unfamiliar with diffusion policies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications and indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Method (MDP formulation and value-function training)] The central claim rests on the value function accurately ranking partially denoised candidates by eventual return. However, because the filtering policy alters the distribution of denoising trajectories at test time relative to the (unfiltered or differently filtered) trajectories used to train the value function, systematic bias in value estimates is possible. This distribution-shift issue is load-bearing for the filtering decisions and is not addressed by additional analysis or targeted ablations in the experiments section.

Authors: We appreciate the referee highlighting this potential distribution shift. The value function is trained exclusively on trajectories sampled from the base diffusion policy (without filtering), which exposes it to a wide range of denoising paths and their associated returns. At test time the filtering policy uses these estimates to prune low-value candidates early; because the policy is trained to maximize the same return objective, the selected trajectories are biased toward high-value paths by design. Nevertheless, we agree that an explicit analysis of any resulting bias would improve the manuscript. In the revision we will add an ablation that measures the correlation between predicted values and realized returns on both filtered and unfiltered trajectory sets, together with a short discussion of observed discrepancies. These results will be placed in the experiments section. revision: partial
Referee: [Experiments] Table 1 and Figure 3 (performance comparisons): while the paper states that FASTER achieves the best overall performance, the reported gains are not accompanied by statistical significance tests or sufficient error analysis across seeds to confirm that improvements are not attributable to variance in the underlying base policies.

Authors: We agree that the current presentation would benefit from formal statistical analysis. While the reported numbers are already averages over multiple random seeds, we did not include error bars or significance tests. In the revised manuscript we will augment Table 1 and Figure 3 with standard-deviation error bars across seeds and add p-values from paired statistical tests (e.g., Wilcoxon signed-rank) comparing FASTER against each baseline. These additions will be described in the experimental protocol subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: standard RL value learning on explicitly defined denoising MDP

full rationale

The paper models denoising + selection as an MDP, then trains a value function on trajectories from that MDP to estimate downstream return for early filtering. This is a conventional critic-learning step whose training signal comes from environment returns, not from the filtering policy itself. No equation reduces the learned value to a tautology or renames a fitted parameter as a prediction; no self-citation supplies a uniqueness theorem or ansatz; the empirical claims rest on task performance measured after training, which is externally falsifiable. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on one core modeling assumption and a learned component; no new physical entities are introduced.

free parameters (1)

Value function and policy parameters in denoising space
These are fitted during training to predict downstream returns and are central to the filtering decisions.

axioms (1)

domain assumption The process of denoising multiple action candidates can be faithfully represented as an MDP whose states are partial trajectories and whose actions are keep/discard decisions.
This modeling choice is stated directly in the abstract as the key insight enabling value-guided early stopping.

pith-pipeline@v0.9.0 · 5519 in / 1427 out tokens · 48276 ms · 2026-05-10T02:43:04.171380+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 43 canonical work pages · 13 internal anchors

[1]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.CoRR, abs/2203.11171, 2022. doi: 10.48550/arXiv.2203.11171. URL https:// arxiv.org/abs/2203.11171

work page Pith review doi:10.48550/arxiv.2203.11171 2022
[2]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, abs/2408.03314, 2024. doi: 10.48550/arXiv.2408.03314. URLhttps://arxiv.org/abs/2408.03314

work page Pith review doi:10.48550/arxiv.2408.03314 2024
[3]

Inference-aware fine-tuning for best-of-n sampling in large language models.arXiv preprint arXiv:2412.15287, 2024

Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust. Inference-aware fine-tuning for best-of-N sampling in large language models.CoRR, abs/2412.15287, 2024. doi: 10.48550/arXiv.2412.15287. URLhttps://arxiv.org/abs/2412.15287

work page doi:10.48550/arxiv.2412.15287 2024
[4]

Inference-time scaling for diffusion models beyond scaling denoising steps

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, and Saining Xie. Inference-time scaling for diffusion models beyond scaling denoising steps.CoRR, abs/2501.09732, 2025. doi: 10.48550/arXiv.2501.09732. URLhttps://arxiv.org/abs/2501.09732

work page doi:10.48550/arxiv.2501.09732 2025
[5]

EXPO: Stable Reinforcement Learning with Expressive Policies

Perry Dong, Qiyang Li, Dorsa Sadigh, and Chelsea Finn. EXPO: Stable reinforcement learning with expressive policies.CoRR, abs/2507.07986, 2025. doi: 10.48550/arXiv.2507.07986. URL https://arxiv.org/abs/2507.07986

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.07986 2025
[6]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. IDQL: Implicit q-learning as an actor-critic method with diffusion policies.CoRR, abs/2304.10573, 2023. doi: 10.48550/arXiv.2304.10573. URL https://arxiv.org/abs/ 2304.10573

work page internal anchor Pith review doi:10.48550/arxiv.2304.10573 2023
[7]

arXiv preprint arXiv:2510.07650 , year=

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, and Benjamin Eysenbach. Value flows, 2026. URLhttps://arxiv.org/abs/2510.07650

work page arXiv 2026
[8]

Ofﬂine reinforcement learning via high-ﬁdelity generative behavior modeling

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling, 2023. URL https://arxiv.org/ abs/2209.14548

work page arXiv 2023
[9]

Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193,

Zhendong Wang, Jonathan J. Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.CoRR, abs/2208.06193, 2022. doi: 10.48550/ arXiv.2208.06193. URLhttps://arxiv.org/abs/2208.06193

work page arXiv 2022
[10]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.CoRR, abs/2303.04137, 2023. doi: 10.48550/arXiv.2303.04137. URL https://arxiv.org/abs/ 2303.04137

work page internal anchor Pith review doi:10.48550/arxiv.2303.04137 2023
[11]

arXiv preprint arXiv:2305.13122 , year=

Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning, 2023. URLhttps://arxiv.org/abs/2305.13122

work page arXiv 2023
[12]

Flow q-learning

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025, Proceedings of Machine Learning Research. PMLR / O...

2025
[13]

Posterior behavioral cloning: Pretraining bc policies for efficient rl finetuning,

Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, and Sergey Levine. Posterior behavioral cloning: Pretraining bc policies for efficient rl finetuning, 2025. URL https: //arxiv.org/abs/2512.16911

work page arXiv 2025
[14]

Learning a diffusion model policy from rewards via q-score matching

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching. InProceedings of the 41st International Conference on Machine Learning, 2024. 11

2024
[15]

Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.CoRR, abs/2410.13816, 2024. doi: 10.48550/arXiv.2410.13816. URLhttps://arxiv.org/abs/2410.13816

work page doi:10.48550/arxiv.2410.13816 2024
[16]

Robomonkey: Scaling test-time sampling and ver- ification for vision-language-action models.arXiv preprint arXiv:2506.17811, 2025

Jacky Kwok, Christopher Agia, Rohan Sinha, Matthew Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, and Marco Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models.CoRR, abs/2506.17811, 2025. doi: 10.48550/arXiv.2506.17811. URLhttps://arxiv.org/abs/2506.17811

work page doi:10.48550/arxiv.2506.17811 2025
[17]

One-step diffusion policy: Fast visuomotor policies via diffusion distillation.arXiv preprint arXiv:2410.21257, 2024

Zhendong Wang, Zhaoshuo Li, Ajay Mandlekar, Zhenjia Xu, Jiaojiao Fan, Yashraj Narang, Linxi Fan, Yuke Zhu, Yogesh Balaji, Mingyuan Zhou, Ming-Yu Liu, and Yu Zeng. One-step diffusion policy: Fast visuomotor policies via diffusion distillation.CoRR, abs/2410.21257,

work page arXiv
[18]

One-step diffusion policy: Fast visuomotor policies via diffusion distillation.arXiv preprint arXiv:2410.21257, 2024

doi: 10.48550/arXiv.2410.21257. URLhttps://arxiv.org/abs/2410.21257

work page doi:10.48550/arxiv.2410.21257
[19]

A ViLA: Asynchronous vision-language agent for streaming multimodal data interaction.arXiv preprint arXiv:2506.18472, 2025

Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning.CoRR, abs/2506.15799, 2025. doi: 10.48550/arXiv.2506. 15799. URLhttps://arxiv.org/abs/2506.15799

work page doi:10.48550/arxiv.2506 2025
[20]

arXiv preprint arXiv:2406.01970 (2024)

Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, and Minhao Cheng. The crystal ball hypothesis in diffusion models: Anticipating object positions from initial noise. CoRR, abs/2406.01970, 2024. doi: 10.48550/arXiv.2406.01970. URL https://arxiv.org/ abs/2406.01970

work page doi:10.48550/arxiv.2406.01970 2024
[21]

Not all noises are created equally: Diffusion noise selection and optimization.CoRR, abs/2407.14041, 2024

Zipeng Qi, Lichen Bai, Haoyi Xiong, and Zeke Xie. Not all noises are created equally: Diffusion noise selection and optimization.CoRR, abs/2407.14041, 2024. doi: 10.48550/arXiv.2407. 14041. URLhttps://arxiv.org/abs/2407.14041

work page doi:10.48550/arxiv.2407 2024
[22]

Initno: Boosting text-to-image dif- fusion models via initial noise optimization.arXiv preprint arXiv: 2404.04650, 2024

Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. InitNO: Boosting text-to-image diffusion models via initial noise optimization.CoRR, abs/2404.04650,

work page arXiv
[23]

Initno: Boosting text-to-image dif- fusion models via initial noise optimization.arXiv preprint arXiv: 2404.04650, 2024

doi: 10.48550/arXiv.2404.04650. URLhttps://arxiv.org/abs/2404.04650

work page doi:10.48550/arxiv.2404.04650
[24]

FIND: Fine-tuning initial noise distribution with policy optimization for diffusion models.CoRR, abs/2407.19453, 2024

Changgu Chen, Libing Yang, Xiaoyan Yang, Lianggangxu Chen, Gaoqi He, Changbo Wang, and Yang Li. FIND: Fine-tuning initial noise distribution with policy optimization for diffusion models.CoRR, abs/2407.19453, 2024. doi: 10.48550/arXiv.2407.19453. URL https:// arxiv.org/abs/2407.19453

work page doi:10.48550/arxiv.2407.19453 2024
[25]

chinchilla optimal

Zikai Zhou, Shitong Shao, Lichen Bai, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework.CoRR, abs/2411.09502, 2024. doi: 10.48550/arXiv. 2411.09502. URLhttps://arxiv.org/abs/2411.09502

work page internal anchor Pith review doi:10.48550/arxiv 2024
[26]

A noise is worth diffusion guidance, 2024

Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, and Seun- gryong Kim. A noise is worth diffusion guidance.CoRR, abs/2412.03895, 2024. doi: 10.48550/arXiv.2412.03895. URLhttps://arxiv.org/abs/2412.03895

work page doi:10.48550/arxiv.2412.03895 2024
[27]

arXiv preprint arXiv:2509.13936 , year=

Harvey Mannering, Zhiwu Huang, and Adam Prügel-Bennett. Noise-level diffusion guidance: Well begun is half done.CoRR, abs/2509.13936, 2025. doi: 10.48550/arXiv.2509.13936. URL https://arxiv.org/abs/2509.13936

work page doi:10.48550/arxiv.2509.13936 2025
[28]

Noise hypernetworks: Amortizing test-time compute in diffusion models.CoRR, abs/2508.09968,

Luca Eyring, Shyamgopal Karthik, Alexey Dosovitskiy, Nataniel Ruiz, and Zeynep Akata. Noise hypernetworks: Amortizing test-time compute in diffusion models.CoRR, abs/2508.09968,

work page arXiv
[29]

Noise hypernetworks: Amortizing test-time compute in diffusion models.CoRR, abs/2508.09968,

doi: 10.48550/arXiv.2508.09968. URLhttps://arxiv.org/abs/2508.09968

work page doi:10.48550/arxiv.2508.09968
[30]

TTSnap: Test-time scaling of diffusion models via noise-aware pruning.CoRR, abs/2511.22242, 2025

Qingtao Yu, Changlin Song, Minghao Sun, Zhengyang Yu, Vinay Kumar Verma, Soumya Roy, Sumit Negi, Hongdong Li, and Dylan Campbell. TTSnap: Test-time scaling of diffusion models via noise-aware pruning.CoRR, abs/2511.22242, 2025. doi: 10.48550/arXiv.2511.22242. URL https://arxiv.org/abs/2511.22242

work page doi:10.48550/arxiv.2511.22242 2025
[31]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. URLhttps://arxiv.org/abs/2209.03003

work page internal anchor Pith review arXiv 2022
[32]

One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models, 2025. URLhttps://arxiv.org/abs/2410.12557. 12

work page arXiv 2025
[33]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A Vision- Language-Action Model with Open-World Generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/ 2403.03206

work page internal anchor Pith review arXiv 2024
[35]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...

work page internal anchor Pith review arXiv 2025
[36]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szym...

work page Pith review arXiv 2025
[38]

What matters for batch online re- inforcement learning in robotics?arXiv preprint, arXiv:2505.08078, 2025

Perry Dong, Suvir Mirchandani, Dorsa Sadigh, and Chelsea Finn. What matters for batch online reinforcement learning in robotics?, 2025. URLhttps://arxiv.org/abs/2505.08078

work page arXiv 2025
[39]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

2023
[40]

Q-learning with adjoint matching,

Qiyang Li and Sergey Levine. Q-learning with adjoint matching, 2026. URL https://arxiv. org/abs/2601.14234

work page internal anchor Pith review arXiv 2026
[41]

Lessing, Annie S

Perry Dong, Alec M. Lessing, Annie S. Chen, and Chelsea Finn. Reinforcement learning via implicit imitation guidance, 2025. URLhttps://arxiv.org/abs/2506.07505

work page arXiv 2025
[42]

TQL: Scaling q-functions with transformers by preventing attention collapse.arXiv preprint arXiv:2602.01439, 2026

Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, and Chelsea Finn. Tql: Scaling q-functions with transformers by preventing attention collapse, 2026. URL https: //arxiv.org/abs/2602.01439

work page arXiv 2026
[43]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

work page internal anchor Pith review arXiv 2026
[44]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, Y...

work page internal anchor Pith review arXiv 2025
[45]

World action models are zero-shot policies,

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...
[46]

URLhttps://arxiv.org/abs/2602.15922

work page internal anchor Pith review arXiv
[47]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas, 2025. URLhttps://arxiv.org/abs/2512.15692. 14 A Additional Experiments A natural question is whether the best-of- N sampling benefits observed in EXPO and IDQL can be recovered by a sing...

work page arXiv 2025