Recognition: unknown
Posterior Augmented Flow Matching
Pith reviewed 2026-05-09 19:33 UTC · model grok-4.3
The pith
Posterior-Augmented Flow Matching gives an unbiased estimator of the flow objective with lower gradient variance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PAFM constructs an unbiased estimator of the standard flow matching objective by replacing the single-target supervision with a mixture over multiple hypothesized endpoints, where the mixture weights come from an importance-sampled approximation to the posterior p(endpoint | intermediate, condition) obtained by multiplying the likelihood of the observed intermediate under each candidate endpoint with the prior probability of that endpoint.
What carries the argument
The importance sampling estimator for the posterior-augmented loss that mixes several candidate targets per training example.
If this is right
- Gradient variance drops because each intermediate receives gradient contributions from many plausible continuations rather than one.
- The learned dynamics avoid flow collapse and map diverse inputs to varied outputs.
- Generation quality improves by up to 3.4 FID points on ImageNet and CC12M for both SiT and MMDiT models.
- The extra computation stays negligible because importance sampling reuses the same model evaluations.
- The method works for both class-conditioned and text-conditioned generation.
Where Pith is reading between the lines
- This variance reduction could let practitioners train with smaller batch sizes or fewer epochs while reaching the same performance.
- Analogous posterior augmentation might stabilize training in other continuous-time generative models such as diffusion or score-based approaches.
- Validating the unbiasedness in a low-dimensional toy setting where the true posterior is computable would confirm that the approximation introduces no bias.
- The choice of prior and likelihood models for the factorization directly controls how effective the variance reduction is.
Load-bearing premise
The proposed factorization of the posterior combined with importance sampling approximates the true distribution over valid target completions without adding bias or too much extra variance.
What would settle it
Compare the PAFM loss value to the standard FM loss on the same batch and check whether their expectations match; also measure per-example gradient norms to verify variance reduction and track final FID to see if the claimed gains appear.
Figures
read the original abstract
Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-dimensional images, each training sample supervises only a single trajectory and intermediate point, yielding an extremely sparse and high-variance training signal. This under-constrained supervision can cause flow collapse, where the learned dynamics memorize specific source-target pairings, mapping diverse inputs to overly similar outputs, failing to generalize. We introduce Posterior-Augmented Flow Matching (PAFM), a theoretically grounded generalization of FM that replaces single-target supervision with an expectation over an approximate posterior of valid target completions for a given intermediate state and condition. PAFM factorizes this intractable posterior into (i) the likelihood of the intermediate under a hypothesized endpoint and (ii) the prior probability of that endpoint under the condition, and uses an importance sampling scheme to construct a mixture over multiple candidate targets. We prove that PAFM yields an unbiased estimator of the original FM objective while substantially reducing gradient variance during training by aggregating information from many plausible continuation trajectories per intermediate. Finally, we show that PAFM improves over FM by up to 3.4 FID50K across different model scales (SiT-B/2 and SiT-XL/2), different architectures (SiT and MMDiT), and in both class and text conditioned benchmarks (ImageNet and CC12M), with a negligible increase in the compute overhead. Code: https://github.com/gstoica27/PAFM.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Posterior-Augmented Flow Matching (PAFM), a generalization of flow matching (FM) for training time-dependent vector fields. Standard FM supervises each intermediate state with only a single target trajectory, leading to high-variance gradients and flow collapse in high-dimensional image domains. PAFM replaces this with an expectation over multiple plausible target completions drawn from an approximate posterior, factorized as the product of a likelihood term (intermediate under hypothesized endpoint) and a prior term (endpoint under conditioning). An importance-sampling scheme constructs a mixture over candidate targets. The authors prove that PAFM is an unbiased estimator of the original FM objective and claim it reduces gradient variance by aggregating information across trajectories. Experiments report FID improvements of up to 3.4 on ImageNet and CC12M using SiT and MMDiT backbones at multiple scales, with negligible compute overhead.
Significance. If the unbiasedness proof holds and the importance-sampling scheme delivers the claimed variance reduction without excessive overhead, PAFM would provide a principled, theoretically grounded improvement to flow-based generative modeling that directly addresses sparse supervision in high-dimensional settings. The empirical gains across architectures, conditioning types, and model scales, together with the public code release, strengthen the contribution. The approach is distinguished by its external proof claim rather than heuristic modifications.
major comments (2)
- [§3.2, Eq. (7)–(9)] §3.2, Eq. (7)–(9): The unbiasedness argument relies on the importance weights exactly recovering the conditional vector-field regression target when the proposal equals the prior; the manuscript should explicitly state the support conditions under which the likelihood term remains well-defined and finite for continuous high-dimensional image data, as any truncation or approximation in the likelihood could introduce bias not captured by the current derivation.
- [Experiments section, Table 1 and Figure 4] Experiments section, Table 1 and Figure 4: The claim of “substantially reducing gradient variance” is central to the motivation yet is supported only indirectly via FID gains; direct measurements (e.g., gradient-norm histograms or variance of the loss estimator across training steps) are absent, leaving open the possibility that observed improvements arise from other factors such as effective batch size or regularization.
minor comments (3)
- [§2.1] §2.1: The notation for the conditional path density p(x_t | x_0, x_1) is introduced without an explicit reminder that it is the same density appearing in the original FM objective; a one-sentence cross-reference would improve readability.
- [Figure 3] Figure 3: The caption does not specify the number of importance samples used per intermediate state or the temperature of the proposal; these hyperparameters are load-bearing for reproducibility and should be stated.
- [Appendix A] Appendix A: The implementation details for sampling from the factorized posterior (e.g., how the likelihood is evaluated for image patches) are only sketched; expanding this subsection with pseudocode would aid readers attempting to reproduce the variance-reduction effect.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and recommendation for minor revision. The comments highlight important points for strengthening the theoretical and empirical presentation, which we address below.
read point-by-point responses
-
Referee: [§3.2, Eq. (7)–(9)] The unbiasedness argument relies on the importance weights exactly recovering the conditional vector-field regression target when the proposal equals the prior; the manuscript should explicitly state the support conditions under which the likelihood term remains well-defined and finite for continuous high-dimensional image data, as any truncation or approximation in the likelihood could introduce bias not captured by the current derivation.
Authors: We agree that the support conditions merit explicit clarification to ensure the derivation is fully rigorous. In the PAFM formulation, the likelihood term is defined as a Gaussian density p(x_t | x_1) = N(x_t; x_1, σ²I) with fixed σ > 0, which is positive and finite everywhere on R^d. Image data are normalized to a bounded interval (e.g., [−1, 1]^d), but the Gaussian support remains all of R^d, so no truncation occurs and the importance weights are always well-defined. The unbiasedness proof therefore holds without additional bias. We will insert a short paragraph after Eq. (9) stating these support conditions and confirming that the Gaussian model introduces no truncation. revision: yes
-
Referee: [Experiments section, Table 1 and Figure 4] The claim of “substantially reducing gradient variance” is central to the motivation yet is supported only indirectly via FID gains; direct measurements (e.g., gradient-norm histograms or variance of the loss estimator across training steps) are absent, leaving open the possibility that observed improvements arise from other factors such as effective batch size or regularization.
Authors: We acknowledge that direct empirical verification of variance reduction would make the central motivation more compelling. While the theoretical analysis establishes that the importance-sampling estimator is unbiased and aggregates information across multiple trajectories, we will add new experiments in the revised manuscript that directly measure the variance of the gradient estimator and the loss value over training steps for both PAFM and standard FM (using identical random seeds and batch sizes). These results will be reported in an additional figure or table in the Experiments section, allowing readers to assess the variance reduction independently of the FID improvements. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central derivation claims that PAFM is an unbiased estimator of the standard flow matching objective via importance sampling over a factorized posterior approximation. This follows directly from the standard properties of importance sampling (the weighted expectation recovers the original target when the proposal matches the prior), without reducing to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. The proof is presented as a first-principles result on the estimator, and empirical gains are reported on external benchmarks (ImageNet, CC12M) rather than self-referential fits. No step in the provided derivation chain collapses by construction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The posterior over valid target completions factorizes into the likelihood of the intermediate state under a hypothesized endpoint times the prior probability of that endpoint under the conditioning variable.
Reference graph
Works this paper leans on
-
[1]
On the surprising behavior of distance metrics in high dimensional space
Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the surprising behavior of distance metrics in high dimensional space. In International conference on database theory. 10 Springer, 2001. 2
2001
-
[2]
floq: Training critics via flow-matching for scaling compute in value-based rl
Bhavya Agrawalla, Michal Nauman, Khush Agrawal, and Aviral Kumar. floq: Training critics via flow-matching for scaling compute in value-based rl. arXiv, 2025. 4
2025
-
[3]
Stochastic interpolants: A unifying framework for flows and diffusions
Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv, 2023. 3
2023
-
[4]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 8, 10
2009
-
[5]
The faiss library
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. arXiv,
-
[6]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. ICML, 2024. 2
2024
-
[7]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024. 3
2024
-
[8]
Scaling rectified flow transformers for high-resolution image synthesis, 2024
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. 3
2024
-
[9]
Flow matching achieves almost minimax optimal convergence
Kenji Fukumizu, Taiji Suzuki, Noboru Isobe, Kazusato Oko, and Masanori Koyama. Flow matching achieves almost minimax optimal convergence. arXiv, 2024. 2, 3
2024
-
[10]
On the relation between rectified flows and optimal transport
Johannes Hertrich, Antonin Chambolle, and Julie Delon. On the relation between rectified flows and optimal transport. arXiv, 2025. 3
2025
-
[11]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 8
2017
-
[12]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https:// arxiv.org/abs/2207.12598. 8
work page internal anchor Pith review arXiv 2022
-
[13]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020. 3
2020
-
[14]
Feldman, Hong Zhang, and Bao Wang
Yuhao Huang, Taos Transue, Shih-Hsin Wang, William M. Feldman, Hong Zhang, and Bao Wang. Improving flow matching by aligning flow divergence. In International Conference on Machine Learning, 2025. 3
2025
-
[15]
Improving flow matching by aligning flow divergence
Yuhao Huang, Taos Transue, Shih-Hsin Wang, William M Feldman, Hong Zhang, and Bao Wang. Improving flow matching by aligning flow divergence. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id= FeZimuj6SG. 2
2025
-
[16]
Minimizing trajectory curvature of ODE-based generative models
Sangyun Lee, Beomsu Kim, and Jong Chul Ye. Minimizing trajectory curvature of ODE-based generative models. In International Conference on Machine Learning, 2023. 3
2023
-
[17]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR, 2023. URL https://openreview.net/forum? id=PqvMRDCJT9t. 1, 2, 3
2023
-
[18]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In Advances in Neural Information Processing Systems,
-
[19]
Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXVII, 2024. URL https: //doi.org/1...
-
[20]
Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, and Siwei Ma
Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers, 2024. URLhttps://arxiv.org/abs/2401.08740. 1
-
[21]
Generating images with sparse representations
Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv, 2021. 8
2021
-
[22]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...
2024
-
[23]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023. 2, 8
2023
-
[24]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 6, 9
2021
-
[25]
Gradient variance reveals failure modes in flow-based generative models
Teodora Reu, Sixtine Dromigny, Michael Bronstein, and Francisco Vargas. Gradient variance reveals failure modes in flow-based generative models. arXiv, 2025. 2, 3
2025
-
[26]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 9
2022
-
[27]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems,
-
[28]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018. 2, 8
2018
-
[29]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv,
-
[30]
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. 3
2021
-
[31]
Contrastive flow matching
George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, and Judy Hoffman. Contrastive flow matching. ICCV, 2025. 1
2025
-
[32]
Im- proving and generalizing flow-based generative models with minibatch optimal transport
Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, et al. Im- proving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, 2024. 3
2024
-
[33]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features....
work page internal anchor Pith review arXiv 2025
-
[34]
H. Wiegand. Kish, l.: Survey sampling. john wiley & sons, inc., new york, london 1965, ix + 643 s., 31 abb., 56 tab., preis 83 s. Biometrische Zeitschrift, 1968. 5
1965
-
[35]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think, 2024. URLhttps://arxiv.org/abs/2410.06940. 1, 4, 8 13 Appendix A Crescent Example Implementation Details Setting.We construct a two-dimensional crescent moo...
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.