pith. machine review for the scientific record. sign in

arxiv: 2605.00825 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

Posterior Augmented Flow Matching

Abhay Nori, Ali Farhadi, George Stoica, Judy Hoffman, Matthew Wallingford, Ranjay Krishna, Sayak Paul, Vivek Ramanujan, Winson Han

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords flow matchingposterior augmentationgenerative modelingimage synthesisvariance reductionunbiased estimationSiTMMDiT
0
0 comments X

The pith

Posterior-Augmented Flow Matching gives an unbiased estimator of the flow objective with lower gradient variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow matching learns to transport points from noise to data by training a vector field on single source-target pairs, but in high dimensions this leaves each intermediate state with only one supervising trajectory. The paper replaces that single target with an expectation over an approximate posterior of possible targets that could have led to the same intermediate point. By factorizing the posterior and drawing multiple candidates via importance sampling, the resulting loss stays equal to the original flow matching objective in expectation while the gradient signal becomes much less noisy. If the reduction in variance holds, training should converge to better models that generalize across many possible paths instead of collapsing to memorized pairings.

Core claim

PAFM constructs an unbiased estimator of the standard flow matching objective by replacing the single-target supervision with a mixture over multiple hypothesized endpoints, where the mixture weights come from an importance-sampled approximation to the posterior p(endpoint | intermediate, condition) obtained by multiplying the likelihood of the observed intermediate under each candidate endpoint with the prior probability of that endpoint.

What carries the argument

The importance sampling estimator for the posterior-augmented loss that mixes several candidate targets per training example.

If this is right

  • Gradient variance drops because each intermediate receives gradient contributions from many plausible continuations rather than one.
  • The learned dynamics avoid flow collapse and map diverse inputs to varied outputs.
  • Generation quality improves by up to 3.4 FID points on ImageNet and CC12M for both SiT and MMDiT models.
  • The extra computation stays negligible because importance sampling reuses the same model evaluations.
  • The method works for both class-conditioned and text-conditioned generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This variance reduction could let practitioners train with smaller batch sizes or fewer epochs while reaching the same performance.
  • Analogous posterior augmentation might stabilize training in other continuous-time generative models such as diffusion or score-based approaches.
  • Validating the unbiasedness in a low-dimensional toy setting where the true posterior is computable would confirm that the approximation introduces no bias.
  • The choice of prior and likelihood models for the factorization directly controls how effective the variance reduction is.

Load-bearing premise

The proposed factorization of the posterior combined with importance sampling approximates the true distribution over valid target completions without adding bias or too much extra variance.

What would settle it

Compare the PAFM loss value to the standard FM loss on the same batch and check whether their expectations match; also measure per-example gradient norms to verify variance reduction and track final FID to see if the claimed gains appear.

Figures

Figures reproduced from arXiv: 2605.00825 by Abhay Nori, Ali Farhadi, George Stoica, Judy Hoffman, Matthew Wallingford, Ranjay Krishna, Sayak Paul, Vivek Ramanujan, Winson Han.

Figure 1
Figure 1. Figure 1 view at source ↗
Figure 2
Figure 2. Figure 2: Posterior augmented flow matching is more robust than flow matching. We train two rectified flow models to generate two crescent moon distributions (left). Second-left: a model trained with FM generates many points in between the two moons. Second-right: the same model trained with PAFM has far less of an issue. Right: PAFM estimates the true velocity field across t significantly better. The PAFM training … view at source ↗
Figure 3
Figure 3. Figure 3: PAFM reduces mini-batch gradi￾ent variance over FM. Optimization steps over the same 500 iterations for REPA-SiT-B/2 [19] models trained with nearest neighbor PAFM with K=16 and FM on ImageNet-1K [4]. Full lines show gradient variance at each iteration, while the correspondingly colored dashed lines indicate the mean variance across all iterations. PAFM reduces gradient variance by ∼ 4×. PAFM marginally de… view at source ↗
read the original abstract

Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-dimensional images, each training sample supervises only a single trajectory and intermediate point, yielding an extremely sparse and high-variance training signal. This under-constrained supervision can cause flow collapse, where the learned dynamics memorize specific source-target pairings, mapping diverse inputs to overly similar outputs, failing to generalize. We introduce Posterior-Augmented Flow Matching (PAFM), a theoretically grounded generalization of FM that replaces single-target supervision with an expectation over an approximate posterior of valid target completions for a given intermediate state and condition. PAFM factorizes this intractable posterior into (i) the likelihood of the intermediate under a hypothesized endpoint and (ii) the prior probability of that endpoint under the condition, and uses an importance sampling scheme to construct a mixture over multiple candidate targets. We prove that PAFM yields an unbiased estimator of the original FM objective while substantially reducing gradient variance during training by aggregating information from many plausible continuation trajectories per intermediate. Finally, we show that PAFM improves over FM by up to 3.4 FID50K across different model scales (SiT-B/2 and SiT-XL/2), different architectures (SiT and MMDiT), and in both class and text conditioned benchmarks (ImageNet and CC12M), with a negligible increase in the compute overhead. Code: https://github.com/gstoica27/PAFM.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Posterior-Augmented Flow Matching (PAFM), a generalization of flow matching (FM) for training time-dependent vector fields. Standard FM supervises each intermediate state with only a single target trajectory, leading to high-variance gradients and flow collapse in high-dimensional image domains. PAFM replaces this with an expectation over multiple plausible target completions drawn from an approximate posterior, factorized as the product of a likelihood term (intermediate under hypothesized endpoint) and a prior term (endpoint under conditioning). An importance-sampling scheme constructs a mixture over candidate targets. The authors prove that PAFM is an unbiased estimator of the original FM objective and claim it reduces gradient variance by aggregating information across trajectories. Experiments report FID improvements of up to 3.4 on ImageNet and CC12M using SiT and MMDiT backbones at multiple scales, with negligible compute overhead.

Significance. If the unbiasedness proof holds and the importance-sampling scheme delivers the claimed variance reduction without excessive overhead, PAFM would provide a principled, theoretically grounded improvement to flow-based generative modeling that directly addresses sparse supervision in high-dimensional settings. The empirical gains across architectures, conditioning types, and model scales, together with the public code release, strengthen the contribution. The approach is distinguished by its external proof claim rather than heuristic modifications.

major comments (2)
  1. [§3.2, Eq. (7)–(9)] §3.2, Eq. (7)–(9): The unbiasedness argument relies on the importance weights exactly recovering the conditional vector-field regression target when the proposal equals the prior; the manuscript should explicitly state the support conditions under which the likelihood term remains well-defined and finite for continuous high-dimensional image data, as any truncation or approximation in the likelihood could introduce bias not captured by the current derivation.
  2. [Experiments section, Table 1 and Figure 4] Experiments section, Table 1 and Figure 4: The claim of “substantially reducing gradient variance” is central to the motivation yet is supported only indirectly via FID gains; direct measurements (e.g., gradient-norm histograms or variance of the loss estimator across training steps) are absent, leaving open the possibility that observed improvements arise from other factors such as effective batch size or regularization.
minor comments (3)
  1. [§2.1] §2.1: The notation for the conditional path density p(x_t | x_0, x_1) is introduced without an explicit reminder that it is the same density appearing in the original FM objective; a one-sentence cross-reference would improve readability.
  2. [Figure 3] Figure 3: The caption does not specify the number of importance samples used per intermediate state or the temperature of the proposal; these hyperparameters are load-bearing for reproducibility and should be stated.
  3. [Appendix A] Appendix A: The implementation details for sampling from the factorized posterior (e.g., how the likelihood is evaluated for image patches) are only sketched; expanding this subsection with pseudocode would aid readers attempting to reproduce the variance-reduction effect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. The comments highlight important points for strengthening the theoretical and empirical presentation, which we address below.

read point-by-point responses
  1. Referee: [§3.2, Eq. (7)–(9)] The unbiasedness argument relies on the importance weights exactly recovering the conditional vector-field regression target when the proposal equals the prior; the manuscript should explicitly state the support conditions under which the likelihood term remains well-defined and finite for continuous high-dimensional image data, as any truncation or approximation in the likelihood could introduce bias not captured by the current derivation.

    Authors: We agree that the support conditions merit explicit clarification to ensure the derivation is fully rigorous. In the PAFM formulation, the likelihood term is defined as a Gaussian density p(x_t | x_1) = N(x_t; x_1, σ²I) with fixed σ > 0, which is positive and finite everywhere on R^d. Image data are normalized to a bounded interval (e.g., [−1, 1]^d), but the Gaussian support remains all of R^d, so no truncation occurs and the importance weights are always well-defined. The unbiasedness proof therefore holds without additional bias. We will insert a short paragraph after Eq. (9) stating these support conditions and confirming that the Gaussian model introduces no truncation. revision: yes

  2. Referee: [Experiments section, Table 1 and Figure 4] The claim of “substantially reducing gradient variance” is central to the motivation yet is supported only indirectly via FID gains; direct measurements (e.g., gradient-norm histograms or variance of the loss estimator across training steps) are absent, leaving open the possibility that observed improvements arise from other factors such as effective batch size or regularization.

    Authors: We acknowledge that direct empirical verification of variance reduction would make the central motivation more compelling. While the theoretical analysis establishes that the importance-sampling estimator is unbiased and aggregates information across multiple trajectories, we will add new experiments in the revised manuscript that directly measure the variance of the gradient estimator and the loss value over training steps for both PAFM and standard FM (using identical random seeds and batch sizes). These results will be reported in an additional figure or table in the Experiments section, allowing readers to assess the variance reduction independently of the FID improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central derivation claims that PAFM is an unbiased estimator of the standard flow matching objective via importance sampling over a factorized posterior approximation. This follows directly from the standard properties of importance sampling (the weighted expectation recovers the original target when the proposal matches the prior), without reducing to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. The proof is presented as a first-principles result on the estimator, and empirical gains are reported on external benchmarks (ImageNet, CC12M) rather than self-referential fits. No step in the provided derivation chain collapses by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the validity of the posterior factorization and the unbiasedness of the importance sampling estimator; these are domain assumptions for the method rather than derived quantities.

axioms (1)
  • domain assumption The posterior over valid target completions factorizes into the likelihood of the intermediate state under a hypothesized endpoint times the prior probability of that endpoint under the conditioning variable.
    Explicitly stated in the abstract as the basis for constructing the mixture via importance sampling.

pith-pipeline@v0.9.0 · 5593 in / 1242 out tokens · 32453 ms · 2026-05-09T19:33:11.437979+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    On the surprising behavior of distance metrics in high dimensional space

    Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the surprising behavior of distance metrics in high dimensional space. In International conference on database theory. 10 Springer, 2001. 2

  2. [2]

    floq: Training critics via flow-matching for scaling compute in value-based rl

    Bhavya Agrawalla, Michal Nauman, Khush Agrawal, and Aviral Kumar. floq: Training critics via flow-matching for scaling compute in value-based rl. arXiv, 2025. 4

  3. [3]

    Stochastic interpolants: A unifying framework for flows and diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv, 2023. 3

  4. [4]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 8, 10

  5. [5]

    The faiss library

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. arXiv,

  6. [6]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. ICML, 2024. 2

  7. [7]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024. 3

  8. [8]

    Scaling rectified flow transformers for high-resolution image synthesis, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. 3

  9. [9]

    Flow matching achieves almost minimax optimal convergence

    Kenji Fukumizu, Taiji Suzuki, Noboru Isobe, Kazusato Oko, and Masanori Koyama. Flow matching achieves almost minimax optimal convergence. arXiv, 2024. 2, 3

  10. [10]

    On the relation between rectified flows and optimal transport

    Johannes Hertrich, Antonin Chambolle, and Julie Delon. On the relation between rectified flows and optimal transport. arXiv, 2025. 3

  11. [11]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 8

  12. [12]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https:// arxiv.org/abs/2207.12598. 8

  13. [13]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020. 3

  14. [14]

    Feldman, Hong Zhang, and Bao Wang

    Yuhao Huang, Taos Transue, Shih-Hsin Wang, William M. Feldman, Hong Zhang, and Bao Wang. Improving flow matching by aligning flow divergence. In International Conference on Machine Learning, 2025. 3

  15. [15]

    Improving flow matching by aligning flow divergence

    Yuhao Huang, Taos Transue, Shih-Hsin Wang, William M Feldman, Hong Zhang, and Bao Wang. Improving flow matching by aligning flow divergence. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id= FeZimuj6SG. 2

  16. [16]

    Minimizing trajectory curvature of ODE-based generative models

    Sangyun Lee, Beomsu Kim, and Jong Chul Ye. Minimizing trajectory curvature of ODE-based generative models. In International Conference on Machine Learning, 2023. 3

  17. [17]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR, 2023. URL https://openreview.net/forum? id=PqvMRDCJT9t. 1, 2, 3

  18. [18]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In Advances in Neural Information Processing Systems,

  19. [19]

    Albergo, Nicholas M

    Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXVII, 2024. URL https: //doi.org/1...

  20. [20]

    Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, and Siwei Ma

    Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers, 2024. URLhttps://arxiv.org/abs/2401.08740. 1

  21. [21]

    Generating images with sparse representations

    Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv, 2021. 8

  22. [22]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  23. [23]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023. 2, 8

  24. [24]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 6, 9

  25. [25]

    Gradient variance reveals failure modes in flow-based generative models

    Teodora Reu, Sixtine Dromigny, Michael Bronstein, and Francisco Vargas. Gradient variance reveals failure modes in flow-based generative models. arXiv, 2025. 2, 3

  26. [26]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 9

  27. [27]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems,

  28. [28]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018. 2, 8

  29. [29]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv,

  30. [30]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. 3

  31. [31]

    Contrastive flow matching

    George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, and Judy Hoffman. Contrastive flow matching. ICCV, 2025. 1

  32. [32]

    Im- proving and generalizing flow-based generative models with minibatch optimal transport

    Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, et al. Im- proving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, 2024. 3

  33. [33]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features....

  34. [34]

    H. Wiegand. Kish, l.: Survey sampling. john wiley & sons, inc., new york, london 1965, ix + 643 s., 31 abb., 56 tab., preis 83 s. Biometrische Zeitschrift, 1968. 5

  35. [35]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think, 2024. URLhttps://arxiv.org/abs/2410.06940. 1, 4, 8 13 Appendix A Crescent Example Implementation Details Setting.We construct a two-dimensional crescent moo...