Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models
Pith reviewed 2026-05-25 05:03 UTC · model grok-4.3
The pith
Precise maintains SDE consistency in stochastic sampling for flow-matching models by freezing the clean-latent posterior mean, enabling faster RL post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Precise is a stochastic sampler that balances effective exploration with stability through an SDE schedule and keeps the denoising trajectory consistent with the underlying flow-matching SDE via a novel approximation that freezes the clean-latent posterior mean, which resolves the excess discretization noise present in existing samplers.
What carries the argument
The SDE-consistent stochastic sampler using frozen clean-latent posterior mean approximation, which prevents deviation from the flow-matching process during discretization at small step counts.
If this is right
- RL post-training converges faster and with greater stability when the sampler maintains SDE consistency.
- Alignment metrics such as PickScore and HPSv2.1 reach state-of-the-art levels under the new sampler.
- Wall-clock training time drops by 13.1 to 53.2 percent while matching or exceeding prior best in-domain performance.
Where Pith is reading between the lines
- The same freezing approximation might reduce discretization artifacts in non-RL sampling tasks for flow-matching or diffusion models.
- The approach could be tested on video or 3D generation pipelines where step count is similarly constrained.
- If the posterior-mean freeze generalizes, it may simplify SDE schedule design across different generative architectures.
Load-bearing premise
The discretization deviation observed in the toy example is the dominant reason for instability and slow convergence when standard samplers are used on full-scale flow-matching models in RL.
What would settle it
Measure the actual discretization error magnitude of standard samplers versus Precise at the exact step counts, noise levels, and latent resolutions used in the paper's RL experiments to see whether the error gap accounts for the reported stability and speed differences.
Figures
read the original abstract
Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing stochastic samplers for flow-matching models in RL post-training introduce excess discretization noise or rely on non-convergent heuristics. It derives an SDE schedule balancing exploration and stability, diagnoses the issue via a toy example, and introduces Precise, which maintains SDE-consistency via a novel approximation that freezes the clean-latent posterior mean. This is asserted to yield significantly faster and more stable reward optimization, SOTA alignment scores (PickScore, HPSv2.1), and 13.1-53.2% less wall-clock training time.
Significance. If the central empirical claims hold and the toy-example diagnosis generalizes, the work would supply a principled, SDE-consistent stochastic policy for RL fine-tuning of flow-matching generators. The explicit separation of schedule design from discretization fidelity, together with the posterior-mean freezing approximation, could become a reusable component for stable online RL in this model class.
major comments (3)
- [Toy example / §3] Toy-example diagnosis (abstract and §3): the claim that discretization deviation is the primary bottleneck is supported only by the toy case; no evidence is given that this error dominates over SDE-schedule choice, reward-model variance, or policy-gradient noise once step counts and resolutions reach those used in the reported RL experiments.
- [Experiments / §5] Experimental validation (abstract and §5): the abstract asserts faster, more stable optimization and SOTA scores but supplies no quantitative results, error bars, ablation tables, or direct verification that the claimed SDE consistency is achieved or that the approximation converges to the data distribution at the operating point.
- [Method / §4] SDE-consistency claim (abstract and §4): the novel approximation is described as restoring consistency by freezing the clean-latent posterior mean, yet no equation, convergence proof, or numerical check is referenced showing that the resulting trajectory satisfies the target SDE at the small step counts used in RL.
minor comments (2)
- Notation for the derived SDE schedule should be introduced with an explicit equation number rather than described only in prose.
- The abstract would be clearer if it cited the specific section containing the toy-example figures and the RL-experiment tables.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We respond to each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Toy example / §3] Toy-example diagnosis (abstract and §3): the claim that discretization deviation is the primary bottleneck is supported only by the toy case; no evidence is given that this error dominates over SDE-schedule choice, reward-model variance, or policy-gradient noise once step counts and resolutions reach those used in the reported RL experiments.
Authors: The toy example in §3 is designed to isolate and diagnose the discretization deviation mechanism under conditions where the exact SDE solution is known. We agree that it does not quantify the relative magnitude of this error against other sources in the full RL setting. The performance differences reported in §5 provide supporting evidence that addressing SDE consistency improves outcomes, but we will add an explicit discussion in the revised §3 addressing the potential interplay with reward-model variance and policy-gradient noise at the step counts used in our experiments. revision: yes
-
Referee: [Experiments / §5] Experimental validation (abstract and §5): the abstract asserts faster, more stable optimization and SOTA scores but supplies no quantitative results, error bars, ablation tables, or direct verification that the claimed SDE consistency is achieved or that the approximation converges to the data distribution at the operating point.
Authors: Section §5 contains the quantitative results (PickScore, HPSv2.1, and the reported 13.1-53.2% wall-clock reductions) along with comparisons to prior samplers. We will revise the manuscript to include error bars on the main metrics, add ablation tables isolating the contribution of the posterior-mean freezing step, and insert a numerical verification that the sampled trajectories remain consistent with the target SDE at the operating step counts. revision: yes
-
Referee: [Method / §4] SDE-consistency claim (abstract and §4): the novel approximation is described as restoring consistency by freezing the clean-latent posterior mean, yet no equation, convergence proof, or numerical check is referenced showing that the resulting trajectory satisfies the target SDE at the small step counts used in RL.
Authors: Section §4 presents the conceptual description of the approximation. In the revision we will add the explicit update equation and a numerical check measuring trajectory deviation from the target SDE at the step counts employed in the RL experiments. A formal convergence proof lies outside the scope of the present work, which builds on existing flow-matching theory; the added numerical check will serve as empirical support. revision: partial
Circularity Check
No significant circularity; derivation chain is self-contained
full rationale
The paper's core steps consist of an analysis of exploration-stability tension to derive an SDE schedule, followed by a toy example demonstrating discretization deviations in prior samplers, and the introduction of a novel approximation (freezing the clean-latent posterior mean) to enforce SDE-consistency. None of these reduce by construction to fitted inputs, self-definitions, or self-citation chains; the claims rest on independent analytical reasoning and empirical results rather than tautological equivalences. No load-bearing equations or parameters are shown to be renamed predictions or imported uniqueness theorems from the authors' prior work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The reverse-time ODE of flow matching can be replaced by an SDE to create a stochastic policy without changing the marginal data distribution at convergence.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.3 (Exact transition under frozen posterior mean). ... zt′ = (1−t′)ˆz0(t) + t′/t e−A(t′,t)/2 (zt − (1−t)ˆz0(t)) + t′ √(1−e−A(t′,t)) w
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 4.1 (First-order logSNR decomposition) ... Δλvel = 2Δt / t(1−t) + o(Δt)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
HunyuanImage 3.0 Technical Report
Release article. Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,
-
[4]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528,
work page 2021
-
[6]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025a. Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured br...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
11 Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Yang Song, Prafulla ...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[14]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952,
-
[16]
Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a. Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understa...
-
[17]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Fast sampling of diffusion models with exponential integrator.arXiv preprint arXiv:2204.13902,
Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv preprint arXiv:2204.13902,
-
[20]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
As shown in Figure 1, CPS samples remain biased toward the inner ring even atN=80
and CPS (Wang & Yu, 2025), andη=1.5 for PRECISE, matching the main experimental protocol. As shown in Figure 1, CPS samples remain biased toward the inner ring even atN=80. Figure 8 isolates the large-N regime by tracking the CPS outer-ring mass as the NFE increases to N=1280 . The target outer-ring mass is 0.5, but the curve does not approach that value ...
work page 2025
-
[22]
Stability AI Community License FLUX.2 Klein 4B Base Black Forest Labs FLUX.2 Klein 4B Base (Black Forest Labs, 2025
work page 2025
-
[23]
MIT CLIPScore / CLIP CLIPScore with OpenAI CLIP (clip-vit-large-patch14) (Hessel et al., 2021; Rad- ford et al.,
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.