pith. machine review for the scientific record. sign in

arxiv: 2604.19730 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: unknown

FASTER: Value-Guided Sampling for Fast RL

Alexander Swerdlow, Chelsea Finn, Dorsa Sadigh, Perry Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningdiffusion policiesvalue-guided samplingtest-time scalingmanipulation taskssampling efficiency
0
0 comments X

The pith

FASTER models denoising of action candidates as an MDP so a learned value function can filter poor samples early and cut compute in diffusion RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FASTER to capture the performance gains of sampling many actions in diffusion-based RL policies while avoiding the full cost of denoising every candidate to completion. It does this by recasting the entire process of generating and selecting among multiple partially denoised actions as a Markov Decision Process whose states live in the denoising space. A policy and value function trained inside that MDP then decide which candidates to keep and which to discard at each step, guided by predicted downstream returns. Sympathetic readers would care because current high-performing generative RL methods become impractical for robotics and other real-time uses once they rely on repeated full sampling at test time. If the approach works, it supplies a drop-in way to keep the benefits of test-time scaling without redesigning the underlying policy or training loop.

Core claim

FASTER treats the denoising of multiple action candidates together with the selection of the best one as a Markov Decision Process defined directly in the space of partially denoised actions. A value function learned in this MDP predicts the eventual return of each candidate from its current denoising state and enables progressive filtering that discards low-value trajectories before they are fully denoised. The resulting lightweight module plugs into existing generative RL algorithms, improves policy performance on long-horizon manipulation tasks in both online and batch-online regimes, and matches the performance of a pretrained vision-language-action model while substantially lowering the

What carries the argument

The denoising-space MDP that frames progressive filtering of action candidates according to their predicted downstream returns.

If this is right

  • FASTER improves the underlying policies across challenging long-horizon manipulation tasks in both online and batch-online RL.
  • It achieves the best overall performance among the compared methods on those tasks.
  • When applied to a pretrained VLA it reaches the same final performance while reducing both training and inference compute.
  • The method can be inserted as a lightweight addition into existing generative RL algorithms without changing their training procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-filtering logic could be tested on other iterative generative processes used inside RL, such as autoregressive token models.
  • Dynamic adjustment of the candidate budget at each denoising step, rather than a fixed number, becomes feasible once value predictions are available.
  • If the denoising-space value function transfers across tasks, it might reduce the need to retrain large policies from scratch when only inference efficiency is required.

Load-bearing premise

A value function trained inside the denoising-space MDP will rank partially denoised action candidates by their true eventual return without systematic bias introduced by early filtering decisions or by distribution shift between training and test-time trajectories.

What would settle it

An experiment in which FASTER is run on the same tasks but the early-filtered trajectories produce lower returns than full denoising of all candidates, or produce no net reduction in compute for equivalent final performance.

Figures

Figures reproduced from arXiv: 2604.19730 by Alexander Swerdlow, Chelsea Finn, Dorsa Sadigh, Perry Dong.

Figure 1
Figure 1. Figure 1: Left: Overview of FASTER. Instead of denoising all N candidates and selecting the best action post-hoc (best-of￾N), FASTER learns a denoise critic Qdn that scores action samples during denoising, often directly on the initial noise itself. Right: Performance of FASTER compared with baseline averaged across tasks for π0.5. The FLOPs are normalized on the x-axis against the amount of FLOPs needed for the bas… view at source ↗
Figure 2
Figure 2. Figure 2: Action Filtering MDP. We model the process of denoising action candidates and selecting the best one as an MDP where the goal is to filter action samples during denoising while maximizing returns. States. A state st = (s, t, Ct, {a (t) i }i∈Ct ) consists of the environment state s, the denoising timestep t ∈ {T, . . . , 1}, the surviving candidate set Ct ⊆ {1, . . . , N}, and the par￾tially denoised interm… view at source ↗
Figure 3
Figure 3. Figure 3: Top: Success rates of FASTER and baselines in the online settings. FASTER-EXPO outperforms strong baselines in sample efficiency. Bottom: Compute comparisons of FASTER-EXPO and EXPO. FASTER eliminates extra denoising during training and inference, yielding large FLOP reductions relative to the EXPO with comparable task performance. Inference Step 0.0× 0.2× 0.4× 0.6× 0.8× 1.0× 1.2× FLOPs (normalized to IDQL… view at source ↗
Figure 4
Figure 4. Figure 4: Success rate and compute comparisons of FASTER-IDQL and IDQL in the online setting. FASTER can be applied to IDQL to eliminate extra denoising rollouts at inference while obtaining the same performance in success rates. (see [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top: Success rate curves of FASTER-EXPO and EXPO in the batch-online setting. FASTER matches the performance of EXPO in iterations. Bottom: Compute comparisons of FASTER-EXPO and EXPO in the batch-online setting. Like in the online setting, in the batch-online setting FASTER-EXPO yields a large FLOP reduction compared to EXPO from not needing to denoise all action samples. RLPD [36]. RLPD is a highly sampl… view at source ↗
Figure 6
Figure 6. Figure 6: Training and inference timing of FASTER-EXPO compared to EXPO. FASTER-EXPO achieves 1.7x improvement in inference time and 4.5x improvement in the update step time. The computational burden of sampling-based methods is particularly acute as model scale in￾creases; recent VLAs commonly reach 3B pa￾rameters, making the cost of denoising N can￾didates at every environment step prohibitive for both training an… view at source ↗
Figure 7
Figure 7. Figure 7: FASTER-EXPO compared to EXPO on top of π0.5. Top: Performance with environment steps. FASTER-EXPO is competitive in performance compared to EXPO. Bottom: Performance with FLOPs. FASTER-EXPO performs significantly better than EXPO under the same compute as FASTER-EXPO chooses the best action sample without denoising all sampled actions in inference and training. 0 40 80 120 160 200 Env Steps (x1000) 0.0 0.2… view at source ↗
Figure 8
Figure 8. Figure 8: Critic-size ablation for FASTER-EXPO on can and square. We compare filtering critics Qdn with parame￾ter counts set to approximately 1.0×, 0.5×, and 0.25× that of Qa. We find that Qdn can be substantially smaller than Qa without degrading performance. 0 30 60 90 120 150 Env Steps (x1000) 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate Can 0 30 60 90 120 150 Env Steps (x1000) 0.0 0.2 0.4 0.6 0.8 1.0 Square t = 0 t = 1… view at source ↗
Figure 10
Figure 10. Figure 10: Learned-filter ablation results. Filtering at the initial seed performs comparably to learning the full filtering policy in the MDP. How does the size of the denoise critic affect performance? We investigate whether the capacity of the denoise critic is a significant fac￾tor in performance. Specifically, we compare architectures ranging from a smaller network (128 hidden units × 3 layers, 300K params) to … view at source ↗
Figure 11
Figure 11. Figure 11: Distillation ablation results. Distilling the value maximizing distribution into a policy performs significantly worse compared to FASTER. B Experiment Details Hyperparameter FASTER-EXPO FASTER-IDQL Actor / critic learning rate 3 × 10−4 / 3 × 10−4 3 × 10−4 / 3 × 10−4 Batch size 256 256 Discount 0.99 0.99 Target update τ 0.005 0.005 UTD ratio 20 20 Candidates N 8 8 Candidates kept after filtering 1 1 Denoi… view at source ↗
read the original abstract

Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FASTER, a method that reformulates the denoising process of multiple action candidates from a diffusion-based policy as a Markov Decision Process (MDP) in denoising space. A policy and value function are learned to progressively filter low-value candidates early in the denoising trajectory, with the value function predicting downstream returns to maximize performance while reducing the number of full denoising steps required. The approach is presented as a lightweight plug-in for existing generative RL algorithms. Empirical claims include consistent policy improvements and best-in-class performance on long-horizon manipulation tasks in both online and batch-online RL settings, plus equivalent performance to a pretrained VLA with substantially lower training and inference compute.

Significance. If the empirical claims hold under the distribution-shift concerns, FASTER would provide a practical mechanism for test-time scaling in diffusion policies without proportional compute cost, which is relevant for robotics and long-horizon control. Code availability supports reproducibility and potential follow-up work.

major comments (2)
  1. [Method (MDP formulation and value-function training)] The central claim rests on the value function accurately ranking partially denoised candidates by eventual return. However, because the filtering policy alters the distribution of denoising trajectories at test time relative to the (unfiltered or differently filtered) trajectories used to train the value function, systematic bias in value estimates is possible. This distribution-shift issue is load-bearing for the filtering decisions and is not addressed by additional analysis or targeted ablations in the experiments section.
  2. [Experiments] Table 1 and Figure 3 (performance comparisons): while the paper states that FASTER achieves the best overall performance, the reported gains are not accompanied by statistical significance tests or sufficient error analysis across seeds to confirm that improvements are not attributable to variance in the underlying base policies.
minor comments (2)
  1. [Abstract] The abstract would benefit from one or two key quantitative results (e.g., success-rate deltas or compute-reduction factors) to allow readers to assess the magnitude of the claimed improvements without reading the full experiments section.
  2. [Background / Method] Notation for the denoising-space state and action spaces is introduced without an explicit comparison table to the original MDP; a small diagram or table would improve clarity for readers unfamiliar with diffusion policies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications and indicating the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Method (MDP formulation and value-function training)] The central claim rests on the value function accurately ranking partially denoised candidates by eventual return. However, because the filtering policy alters the distribution of denoising trajectories at test time relative to the (unfiltered or differently filtered) trajectories used to train the value function, systematic bias in value estimates is possible. This distribution-shift issue is load-bearing for the filtering decisions and is not addressed by additional analysis or targeted ablations in the experiments section.

    Authors: We appreciate the referee highlighting this potential distribution shift. The value function is trained exclusively on trajectories sampled from the base diffusion policy (without filtering), which exposes it to a wide range of denoising paths and their associated returns. At test time the filtering policy uses these estimates to prune low-value candidates early; because the policy is trained to maximize the same return objective, the selected trajectories are biased toward high-value paths by design. Nevertheless, we agree that an explicit analysis of any resulting bias would improve the manuscript. In the revision we will add an ablation that measures the correlation between predicted values and realized returns on both filtered and unfiltered trajectory sets, together with a short discussion of observed discrepancies. These results will be placed in the experiments section. revision: partial

  2. Referee: [Experiments] Table 1 and Figure 3 (performance comparisons): while the paper states that FASTER achieves the best overall performance, the reported gains are not accompanied by statistical significance tests or sufficient error analysis across seeds to confirm that improvements are not attributable to variance in the underlying base policies.

    Authors: We agree that the current presentation would benefit from formal statistical analysis. While the reported numbers are already averages over multiple random seeds, we did not include error bars or significance tests. In the revised manuscript we will augment Table 1 and Figure 3 with standard-deviation error bars across seeds and add p-values from paired statistical tests (e.g., Wilcoxon signed-rank) comparing FASTER against each baseline. These additions will be described in the experimental protocol subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: standard RL value learning on explicitly defined denoising MDP

full rationale

The paper models denoising + selection as an MDP, then trains a value function on trajectories from that MDP to estimate downstream return for early filtering. This is a conventional critic-learning step whose training signal comes from environment returns, not from the filtering policy itself. No equation reduces the learned value to a tautology or renames a fitted parameter as a prediction; no self-citation supplies a uniqueness theorem or ansatz; the empirical claims rest on task performance measured after training, which is externally falsifiable. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on one core modeling assumption and a learned component; no new physical entities are introduced.

free parameters (1)
  • Value function and policy parameters in denoising space
    These are fitted during training to predict downstream returns and are central to the filtering decisions.
axioms (1)
  • domain assumption The process of denoising multiple action candidates can be faithfully represented as an MDP whose states are partial trajectories and whose actions are keep/discard decisions.
    This modeling choice is stated directly in the abstract as the key insight enabling value-guided early stopping.

pith-pipeline@v0.9.0 · 5519 in / 1427 out tokens · 48276 ms · 2026-05-10T02:43:04.171380+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 43 canonical work pages · 13 internal anchors

  1. [1]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.CoRR, abs/2203.11171, 2022. doi: 10.48550/arXiv.2203.11171. URL https:// arxiv.org/abs/2203.11171

  2. [2]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, abs/2408.03314, 2024. doi: 10.48550/arXiv.2408.03314. URLhttps://arxiv.org/abs/2408.03314

  3. [3]

    Inference-aware fine-tuning for best-of-n sampling in large language models.arXiv preprint arXiv:2412.15287, 2024

    Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust. Inference-aware fine-tuning for best-of-N sampling in large language models.CoRR, abs/2412.15287, 2024. doi: 10.48550/arXiv.2412.15287. URLhttps://arxiv.org/abs/2412.15287

  4. [4]

    Inference-time scaling for diffusion models beyond scaling denoising steps

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, and Saining Xie. Inference-time scaling for diffusion models beyond scaling denoising steps.CoRR, abs/2501.09732, 2025. doi: 10.48550/arXiv.2501.09732. URLhttps://arxiv.org/abs/2501.09732

  5. [5]

    EXPO: Stable Reinforcement Learning with Expressive Policies

    Perry Dong, Qiyang Li, Dorsa Sadigh, and Chelsea Finn. EXPO: Stable reinforcement learning with expressive policies.CoRR, abs/2507.07986, 2025. doi: 10.48550/arXiv.2507.07986. URL https://arxiv.org/abs/2507.07986

  6. [6]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. IDQL: Implicit q-learning as an actor-critic method with diffusion policies.CoRR, abs/2304.10573, 2023. doi: 10.48550/arXiv.2304.10573. URL https://arxiv.org/abs/ 2304.10573

  7. [7]

    arXiv preprint arXiv:2510.07650 , year=

    Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, and Benjamin Eysenbach. Value flows, 2026. URLhttps://arxiv.org/abs/2510.07650

  8. [8]

    Offline reinforcement learning via high-fidelity generative behavior modeling

    Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling, 2023. URL https://arxiv.org/ abs/2209.14548

  9. [9]

    Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193,

    Zhendong Wang, Jonathan J. Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.CoRR, abs/2208.06193, 2022. doi: 10.48550/ arXiv.2208.06193. URLhttps://arxiv.org/abs/2208.06193

  10. [10]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.CoRR, abs/2303.04137, 2023. doi: 10.48550/arXiv.2303.04137. URL https://arxiv.org/abs/ 2303.04137

  11. [11]

    arXiv preprint arXiv:2305.13122 , year=

    Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning, 2023. URLhttps://arxiv.org/abs/2305.13122

  12. [12]

    Flow q-learning

    Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025, Proceedings of Machine Learning Research. PMLR / O...

  13. [13]

    Posterior behavioral cloning: Pretraining bc policies for efficient rl finetuning,

    Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, and Sergey Levine. Posterior behavioral cloning: Pretraining bc policies for efficient rl finetuning, 2025. URL https: //arxiv.org/abs/2512.16911

  14. [14]

    Learning a diffusion model policy from rewards via q-score matching

    Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching. InProceedings of the 41st International Conference on Machine Learning, 2024. 11

  15. [15]

    Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

    Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.CoRR, abs/2410.13816, 2024. doi: 10.48550/arXiv.2410.13816. URLhttps://arxiv.org/abs/2410.13816

  16. [16]

    Robomonkey: Scaling test-time sampling and ver- ification for vision-language-action models.arXiv preprint arXiv:2506.17811, 2025

    Jacky Kwok, Christopher Agia, Rohan Sinha, Matthew Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, and Marco Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models.CoRR, abs/2506.17811, 2025. doi: 10.48550/arXiv.2506.17811. URLhttps://arxiv.org/abs/2506.17811

  17. [17]

    One-step diffusion policy: Fast visuomotor policies via diffusion distillation.arXiv preprint arXiv:2410.21257, 2024

    Zhendong Wang, Zhaoshuo Li, Ajay Mandlekar, Zhenjia Xu, Jiaojiao Fan, Yashraj Narang, Linxi Fan, Yuke Zhu, Yogesh Balaji, Mingyuan Zhou, Ming-Yu Liu, and Yu Zeng. One-step diffusion policy: Fast visuomotor policies via diffusion distillation.CoRR, abs/2410.21257,

  18. [18]
  19. [19]

    A ViLA: Asynchronous vision-language agent for streaming multimodal data interaction.arXiv preprint arXiv:2506.18472, 2025

    Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning.CoRR, abs/2506.15799, 2025. doi: 10.48550/arXiv.2506. 15799. URLhttps://arxiv.org/abs/2506.15799

  20. [20]

    arXiv preprint arXiv:2406.01970 (2024)

    Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, and Minhao Cheng. The crystal ball hypothesis in diffusion models: Anticipating object positions from initial noise. CoRR, abs/2406.01970, 2024. doi: 10.48550/arXiv.2406.01970. URL https://arxiv.org/ abs/2406.01970

  21. [21]

    Not all noises are created equally: Diffusion noise selection and optimization.CoRR, abs/2407.14041, 2024

    Zipeng Qi, Lichen Bai, Haoyi Xiong, and Zeke Xie. Not all noises are created equally: Diffusion noise selection and optimization.CoRR, abs/2407.14041, 2024. doi: 10.48550/arXiv.2407. 14041. URLhttps://arxiv.org/abs/2407.14041

  22. [22]

    Initno: Boosting text-to-image dif- fusion models via initial noise optimization.arXiv preprint arXiv: 2404.04650, 2024

    Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. InitNO: Boosting text-to-image diffusion models via initial noise optimization.CoRR, abs/2404.04650,

  23. [23]
  24. [24]

    FIND: Fine-tuning initial noise distribution with policy optimization for diffusion models.CoRR, abs/2407.19453, 2024

    Changgu Chen, Libing Yang, Xiaoyan Yang, Lianggangxu Chen, Gaoqi He, Changbo Wang, and Yang Li. FIND: Fine-tuning initial noise distribution with policy optimization for diffusion models.CoRR, abs/2407.19453, 2024. doi: 10.48550/arXiv.2407.19453. URL https:// arxiv.org/abs/2407.19453

  25. [25]

    chinchilla optimal

    Zikai Zhou, Shitong Shao, Lichen Bai, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework.CoRR, abs/2411.09502, 2024. doi: 10.48550/arXiv. 2411.09502. URLhttps://arxiv.org/abs/2411.09502

  26. [26]

    A noise is worth diffusion guidance, 2024

    Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, and Seun- gryong Kim. A noise is worth diffusion guidance.CoRR, abs/2412.03895, 2024. doi: 10.48550/arXiv.2412.03895. URLhttps://arxiv.org/abs/2412.03895

  27. [27]

    arXiv preprint arXiv:2509.13936 , year=

    Harvey Mannering, Zhiwu Huang, and Adam Prügel-Bennett. Noise-level diffusion guidance: Well begun is half done.CoRR, abs/2509.13936, 2025. doi: 10.48550/arXiv.2509.13936. URL https://arxiv.org/abs/2509.13936

  28. [28]

    Noise hypernetworks: Amortizing test-time compute in diffusion models.CoRR, abs/2508.09968,

    Luca Eyring, Shyamgopal Karthik, Alexey Dosovitskiy, Nataniel Ruiz, and Zeynep Akata. Noise hypernetworks: Amortizing test-time compute in diffusion models.CoRR, abs/2508.09968,

  29. [29]
  30. [30]

    TTSnap: Test-time scaling of diffusion models via noise-aware pruning.CoRR, abs/2511.22242, 2025

    Qingtao Yu, Changlin Song, Minghao Sun, Zhengyang Yu, Vinay Kumar Verma, Soumya Roy, Sumit Negi, Hongdong Li, and Dylan Campbell. TTSnap: Test-time scaling of diffusion models via noise-aware pruning.CoRR, abs/2511.22242, 2025. doi: 10.48550/arXiv.2511.22242. URL https://arxiv.org/abs/2511.22242

  31. [31]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. URLhttps://arxiv.org/abs/2209.03003

  32. [32]

    One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models, 2025. URLhttps://arxiv.org/abs/2410.12557. 12

  33. [33]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A Vision- Language-Action Model with Open-World Generalization.arXiv preprint arXiv:2504.16054, 2025

  34. [34]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/ 2403.03206

  35. [35]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...

  36. [36]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  37. [37]

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szym...

  38. [38]

    What matters for batch online re- inforcement learning in robotics?arXiv preprint, arXiv:2505.08078, 2025

    Perry Dong, Suvir Mirchandani, Dorsa Sadigh, and Chelsea Finn. What matters for batch online reinforcement learning in robotics?, 2025. URLhttps://arxiv.org/abs/2505.08078

  39. [39]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  40. [40]

    Q-learning with adjoint matching,

    Qiyang Li and Sergey Levine. Q-learning with adjoint matching, 2026. URL https://arxiv. org/abs/2601.14234

  41. [41]

    Lessing, Annie S

    Perry Dong, Alec M. Lessing, Annie S. Chen, and Chelsea Finn. Reinforcement learning via implicit imitation guidance, 2025. URLhttps://arxiv.org/abs/2506.07505

  42. [42]

    TQL: Scaling q-functions with transformers by preventing attention collapse.arXiv preprint arXiv:2602.01439, 2026

    Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, and Chelsea Finn. Tql: Scaling q-functions with transformers by preventing attention collapse, 2026. URL https: //arxiv.org/abs/2602.01439

  43. [43]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  44. [44]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, Y...

  45. [45]

    World action models are zero-shot policies,

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

  46. [46]

    URLhttps://arxiv.org/abs/2602.15922

  47. [47]

    mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas, 2025. URLhttps://arxiv.org/abs/2512.15692. 14 A Additional Experiments A natural question is whether the best-of- N sampling benefits observed in EXPO and IDQL can be recovered by a sing...