Recognition: 2 theorem links
· Lean TheoremMean Flows for One-step Generative Modeling
Pith reviewed 2026-05-11 14:24 UTC · model grok-4.3
The pith
MeanFlow derives an identity between average and instantaneous velocities to enable high-quality one-step generative modeling from scratch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models.
What carries the argument
The derived identity relating average velocity (integral of velocity over a time interval) to instantaneous velocity at a chosen point, which is used to supervise neural network training for direct one-step sampling.
If this is right
- One-step models become competitive with multi-step diffusion and flow models on large image datasets without extra training stages.
- Generative training simplifies because the network learns directly from the velocity identity rather than through distillation or curriculum.
- The performance gap between single-evaluation and iterative sampling narrows substantially on tasks like ImageNet 256x256 generation.
- The same identity-based training can be applied to other flow-based generative settings to improve efficiency.
Where Pith is reading between the lines
- The average-velocity idea could be adapted to conditional generation or other modalities by changing how the interval for averaging is chosen.
- Variations in the exact form of the identity might yield further gains in sample quality or training stability.
- Similar averaging principles might be tested in non-flow generative models to see if they reduce the need for many sampling steps.
Load-bearing premise
A neural network can accurately learn to produce samples by following the average-to-instantaneous velocity identity without any implicit reliance on multi-step pre-training or adjustments.
What would settle it
Train a MeanFlow model from scratch on a held-out dataset using only the velocity identity loss and measure whether one-step samples reach FID scores close to multi-step baselines without any extra techniques.
read the original abstract
We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MeanFlow, a framework for one-step generative modeling that defines average velocity (in contrast to instantaneous velocity in Flow Matching), derives an identity between the two, and uses the identity to directly train a neural network for single-step sampling. The method is claimed to be self-contained with no pre-training, distillation, or curriculum learning required. It reports an FID of 3.43 on ImageNet 256x256 with 1-NFE, outperforming prior one-step diffusion/flow models and narrowing the gap to multi-step predecessors.
Significance. If the empirical result and the learnability of the identity hold without hidden multi-step effects, the work would be significant for simplifying generative modeling pipelines by enabling high-quality one-step sampling from scratch. This could reduce inference cost and influence future designs of flow-based models. The self-contained training protocol, if verified, is a clear strength.
major comments (2)
- [Abstract] Abstract: The central empirical claim of FID=3.43 (1-NFE, ImageNet 256x256, trained from scratch) is presented with no experimental details, training protocol, baselines, error bars, or ablation studies. This is load-bearing for the outperformance assertion and prevents verification of whether the derived identity alone suffices.
- [Method] Method section (identity derivation): The identity relating average velocity to instantaneous velocity is used to guide training, but the manuscript does not demonstrate that this identity remains exact (or sufficiently accurate) under neural-network approximation, finite discretization of the time integral, and optimization in high-dimensional discrete data. If the identity only holds in the continuous limit, the 1-NFE sampler may not recover high-fidelity samples as claimed.
minor comments (2)
- [Abstract] The abstract would be clearer if it briefly stated the form of the training objective or the key identity equation.
- [Method] Notation for average vs. instantaneous velocity should be introduced with explicit equations early in the method section to avoid ambiguity.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claim of FID=3.43 (1-NFE, ImageNet 256x256, trained from scratch) is presented with no experimental details, training protocol, baselines, error bars, or ablation studies. This is load-bearing for the outperformance assertion and prevents verification of whether the derived identity alone suffices.
Authors: We agree that the abstract, due to its brevity, does not include the full experimental details. However, the complete training protocol, including optimizer settings, batch size, number of training steps, data augmentation, and evaluation procedure, along with comparisons to baselines and ablation studies, are thoroughly documented in Sections 4 (Experiments) and 5 (Ablations) of the manuscript. The FID score is computed using the standard protocol with 50k samples and the official Inception-v3 model. To make the abstract more informative, we will revise it to include a short description of the training setup (e.g., 'trained from scratch on ImageNet 256x256 using a standard U-Net architecture for 500k iterations'). We believe this addresses the concern while maintaining the abstract's conciseness. Error bars are not reported as single-run results are standard in the field, but we can note the stability across seeds if needed. revision: yes
-
Referee: [Method] Method section (identity derivation): The identity relating average velocity to instantaneous velocity is used to guide training, but the manuscript does not demonstrate that this identity remains exact (or sufficiently accurate) under neural-network approximation, finite discretization of the time integral, and optimization in high-dimensional discrete data. If the identity only holds in the continuous limit, the 1-NFE sampler may not recover high-fidelity samples as claimed.
Authors: The identity in Equation (2) is derived exactly for the continuous-time case without any approximation. In practice, we discretize the time integral using a Riemann sum with a large number of steps during training (as detailed in the implementation), and the neural network is trained to minimize the discrepancy. We provide empirical evidence that this works: the model achieves state-of-the-art 1-NFE performance, which would not be possible if the approximation were poor. To directly address the concern, we will add a new paragraph in the Method section with a toy experiment (e.g., on 2D Gaussian mixtures) demonstrating that the learned average velocity closely matches the integrated instantaneous velocity even under NN approximation and coarse discretization. Additionally, we will include an analysis of the discretization error bound. This shows that the identity holds sufficiently accurately for high-dimensional image data as evidenced by the results. revision: yes
Circularity Check
Derivation chain is self-contained with no reduction to inputs
full rationale
The paper derives an identity between average and instantaneous velocities from first principles and uses this identity to define the training objective for the MeanFlow model. This derivation is presented as independent mathematical content that guides neural network training without relying on fitted parameters from the target data, self-citations for uniqueness, or renaming of known empirical patterns. The one-step sampling performance is reported as an empirical outcome of the trained model rather than a quantity forced by construction in the derivation itself. No load-bearing step reduces the central claim to a tautology or to the inputs by definition, making the framework self-contained.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 55 Pith papers
-
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
-
Generative Modeling with Flux Matching
Flux Matching generalizes score-based generative modeling by using a weaker objective that admits infinitely many non-conservative vector fields with the data as stationary distribution, enabling new design choices be...
-
Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching
In flow matching, the uncertainty of the clean data given the current state is exactly the divergence of the velocity field (up to a known scalar).
-
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
-
Aligning Flow Map Policies with Optimal Q-Guidance
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
-
DriftXpress: Faster Drifting Models via Projected RKHS Fields
DriftXpress approximates drifting kernels via projected RKHS fields to lower training cost of one-step generative models while matching original FID scores.
-
One-Step Generative Modeling via Wasserstein Gradient Flows
W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
-
Physics-Informed Neural PDE Solvers via Spatio-Temporal MeanFlow
Spatio-Temporal MeanFlow adapts MeanFlow to PDEs by replacing the generative velocity field with the physical operator and extending the integral constraint to the spatio-temporal domain, yielding a unified solver for...
-
Normalizing Trajectory Models
NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
-
Normalizing Trajectory Models
NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
-
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.
-
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
CoFlow achieves state-of-the-art coordination quality in offline MARL using only 1-3 denoising steps by natively coupling velocity fields across agents via coordinated attention and gating.
-
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
CoFlow achieves state-of-the-art coordination in offline MARL using single-pass joint velocity fields with Coordinated Velocity Attention and Adaptive Coordination Gating.
-
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
Self-Improving Tabular Language Models via Iterative Group Alignment
TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching Inference
LayerCache enables per-layer-group caching in flow matching models via adaptive JVP span selection and greedy 3D scheduling, delivering 1.37x speedup with PSNR 37.46 dB, SSIM 0.9834, and LPIPS 0.0178 on Qwen-Image.
-
Isokinetic Flow Matching for Pathwise Straightening of Generative Flows
Isokinetic Flow Matching adds a lightweight regularization term to flow matching that penalizes acceleration along paths via self-guided finite differences, yielding straighter trajectories and large gains in few-step...
-
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
-
VOSR: A Vision-Only Generative Model for Image Super-Resolution
VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a r...
-
Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models
Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
Gradient-Free Noise Optimization for Reward Alignment in Generative Models
ZeNO frames noise optimization as a path-integral control problem solvable from zeroth-order reward evaluations, connecting to implicit Langevin dynamics for reward-tilted distributions.
-
Gradient-Free Noise Optimization for Reward Alignment in Generative Models
ZeNO formulates noise optimization for reward alignment as a path-integral control problem solvable via zeroth-order reward evaluations alone, connecting to Langevin dynamics under an Ornstein-Uhlenbeck process.
-
Noise-Started One-Step Real-World Super-Resolution via LR-Conditioned SplitMeanFlow and GAN Refinement
SMFSR achieves state-of-the-art perceptual quality among one-step diffusion-based real-world super-resolution methods by preserving noise-started generation via LR-conditioned SplitMeanFlow and GAN refinement.
-
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
-
Slowly Annealed Langevin Dynamics: Theory and Applications to Training-Free Guided Generation
Slowly Annealed Langevin Dynamics provides non-asymptotic KL-based convergence guarantees for tracking moving targets and enables training-free guided generation via a velocity-aware correction that accounts for pretr...
-
FlashMol: High-Quality Molecule Generation in as Few as Four Steps
FlashMol produces chemically valid 3D molecules in 4 steps via distribution matching distillation with respaced timesteps and Jensen-Shannon regularization, matching or exceeding 1000-step teacher performance on QM9 a...
-
Tyche: One Step Flow for Efficient Probabilistic Weather Forecasting
Tyche achieves competitive probabilistic weather forecasting skill and calibration using a single-step flow model with JVP-regularized training and rollout finetuning.
-
Velox: Learning Representations of 4D Geometry and Appearance
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
-
A Few-Step Generative Model on Cumulative Flow Maps
Cumulative flow maps unify few-step generative modeling for diffusion and flow models via cumulative transport and parameterization with minimal changes to time embeddings and objectives.
-
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
Hydra-DP3 is a lightweight 3D diffusion policy that uses frequency analysis of smooth action trajectories to enable two-step DDIM inference and achieves state-of-the-art results with under 1% of prior parameters.
-
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
Hydra-DP3 achieves SOTA visuomotor performance with under 1% of prior 3D diffusion policy parameters by using frequency analysis to justify a lightweight decoder and two-step DDIM inference.
-
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
CoFlow preserves inter-agent coordination in few-step offline MARL by using a natively joint velocity field with Coordinated Velocity Attention and Adaptive Coordination Gating, matching or exceeding baselines in 1-3 ...
-
On the Role of Strain and Vorticity in Numerical Integration Error for Flow Matching
Strain in the velocity Jacobian exponentially amplifies integration errors in flow matching while vorticity contributes linearly; strain-weighted Jacobian regularization reduces error up to 2.7x at NFE=5 and improves ...
-
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
-
Fisher Decorator: Refining Flow Policy via a Local Transport Map
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
-
Mean Flow Policy Optimization
Mean Flow Policy Optimization (MFPO) uses few-step flow-based models for RL policies and achieves performance on par with or better than diffusion-based methods while substantially lowering training and inference time...
-
Self-Adversarial One Step Generation via Condition Shifting
APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
-
MENO: MeanFlow-Enhanced Neural Operators for Dynamical Systems
MENO enhances neural operators with MeanFlow to restore multi-scale accuracy in dynamical system predictions while keeping inference costs low, achieving up to 2x better power spectrum accuracy and 12x faster inferenc...
-
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
-
PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows
PixelFlowCast delivers high-fidelity precipitation nowcasts from radar sequences using a latent-free Pixel Mean Flows predictor guided by a deterministic coarse stage and KANCondNet features.
-
Deterministic Decomposition of Stochastic Generative Dynamics
Stochastic generative dynamics admit a transport-osmotic decomposition of the deterministic field, supporting Bridge Matching for interpretable and tunable generation.
-
Consistency Regularised Gradient Flows for Inverse Problems
A consistency-regularized Euclidean-Wasserstein-2 gradient flow performs joint posterior sampling and prompt optimization in latent space for efficient low-NFE inverse problem solving with diffusion models.
-
Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations
A simplified one-step diffusion distillation uses pretrained teacher features directly for drifting loss plus a mode coverage term, achieving FID 1.58 on ImageNet-64 and 18.4 on SDXL.
-
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
RA-CMF: Region-Adaptive Conditional MeanFlow for CT Image Reconstruction
RA-CMF integrates conditional MeanFlow for trajectory-based image enhancement with an RL-driven policy for tile-wise adaptive refinement budgets, achieving average PSNR of 34.23 and SSIM of 0.95 on CT images with stro...
-
Adversarial Flow Matching for Imperceptible Attacks on End-to-End Autonomous Driving
AFM is a novel gray-box adversarial attack using flow matching to create visually imperceptible perturbations that degrade performance of Vision-Language-Action and modular end-to-end autonomous driving models while s...
-
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
Reference graph
Works this paper leans on
-
[1]
Building Normalizing Flows with Stochastic Interpolants
Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[2]
Building normalizing flows with stochastic interpolants
Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In International Conference on Learning Representations (ICLR), 2023. 1, 2
work page 2023
-
[3]
Flow map matching.arXiv preprint arXiv:2406.07507,
Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. Flow map matching. arXiv preprint arXiv:2406.07507, 2024. 2
-
[4]
JAX: composable transformations of Python+NumPy programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax. 16
work page 2018
-
[5]
Large scale GAN training for high fidelity natural image synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR) ,
-
[6]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 9
work page 2022
-
[7]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 2, 7
work page 2009
-
[8]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Neural Information Processing Systems (NeurIPS), 34, 2021. 9
work page 2021
-
[9]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. 7, 14
work page 2021
-
[10]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
-
[11]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024. 1, 7, 8
work page 2024
-
[12]
An introduction to flow matching
Tor Fjelde, Emile Mathieu, and Vincent Dutordoir. An introduction to flow matching. https: //mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html, January 2024. Cambridge Machine Learning Group Blog. 3
work page 2024
-
[13]
One step diffusion via shortcut models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. In International Conference on Learning Representations (ICLR), 2025. 1, 2, 5, 6, 7, 8, 9 10
work page 2025
-
[14]
One-step diffusion distillation via deep equilibrium models
Zhengyang Geng, Ashwini Pokle, and J Zico Kolter. One-step diffusion distillation via deep equilibrium models. Neural Information Processing Systems (NeurIPS), 36, 2024. 2
work page 2024
-
[15]
Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. arXiv preprint arXiv:2406.14548, 2024. 1, 2, 5, 6, 7, 8, 9, 15
-
[16]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. 14
work page internal anchor Pith review arXiv 2017
-
[17]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Information Processing Systems (NeurIPS), 2017. 7
work page 2017
-
[18]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 2, 6, 14, 15
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Neural Information Processing Systems (NeurIPS), 2020. 1, 2
work page 2020
-
[20]
simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning (ICML), 2023. 9
work page 2023
-
[21]
Scaling up gans for text-to-image synthesis
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 9
work page 2023
-
[22]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Neural Information Processing Systems (NeurIPS), 2022. 2, 9, 14
work page 2022
-
[23]
Consistency trajectory models: Learning probability flow ODE trajectory of diffusion
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In International Conference on Learning Representations (ICLR), 2024. 5
work page 2024
-
[24]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna- tional Conference on Learning Representations (ICLR), 2015. 14
work page 2015
-
[25]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. URL https: //www.cs.toronto.edu/~kriz/cifar.html. 9
work page 2009
-
[26]
Epistola LXXI ad johannem bernoullium, 5 aug 1697
Gottfried Wilhelm Leibniz. Epistola LXXI ad johannem bernoullium, 5 aug 1697. In Johann Bernoulli, editor, Virorum celeberrimorum G. G. Leibnitii et Johannis Bernoullii Commercium philosophicum et mathematicum, volume I, pages 368–370. Marc-Michel Bousquet, Lausanne & Geneva, 1745. URL https://archive.org/details/bub_gb_lO3wOMxjoF8C/page/368. First explic...
-
[27]
Autoregressive image generation without vector quantization
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. Neural Information Processing Systems (NeurIPS),
-
[28]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In International Conference on Learning Representations (ICLR), 2023. 1, 2, 3, 5, 6
work page 2023
-
[29]
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code,
-
[30]
URL https://arxiv.org/abs/2412.06264. 14
work page internal anchor Pith review arXiv
-
[31]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations (ICLR), 2023. 1, 2, 3 11
work page 2023
-
[32]
Simplifying, stabilizing and scaling continuous-time consistency models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. In International Conference on Learning Representations (ICLR), 2025. 1, 2, 5, 6, 9
work page 2025
-
[33]
Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Neural Information Processing Systems (NeurIPS), 2024. 2
work page 2024
-
[34]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision (ECCV), 2024. 1, 7, 8, 9
work page 2024
-
[35]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 7, 8, 9, 14
work page 2023
-
[36]
Movie Gen: A cast of media foundation models, 2025
Adam Polyak, , et al. Movie Gen: A cast of media foundation models, 2025. 1
work page 2025
-
[37]
Variational inference with normalizing flows
Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning (ICML), 2015. 2
work page 2015
-
[38]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 7, 9
work page 2021
-
[39]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted interven- tion (MICCAI), 2015. 9, 14
work page 2015
-
[40]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations (ICLR), 2022. 2
work page 2022
-
[41]
Stylegan-xl: Scaling stylegan to large diverse datasets
Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM Transactions on Graphics (SIGGRAPH), 2022. 9
work page 2022
-
[42]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. In European Conference on Computer Vision (ECCV), 2024. 2
work page 2024
-
[43]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015. 1, 2
work page 2015
-
[44]
Improved techniques for training consistency models
Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. In International Conference on Learning Representations (ICLR), 2024. 1, 2, 5, 6, 7, 8, 9, 15
work page 2024
-
[45]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Neural Information Processing Systems (NeurIPS), 2019. 1, 2, 9, 14
work page 2019
-
[46]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021. 2
work page 2021
-
[47]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning (ICML), 2023. 1, 2, 3, 5, 6, 7
work page 2023
-
[48]
Visual autoregressive modeling: Scalable image generation via next-scale prediction
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Neural Information Processing Systems (NeurIPS), 2024. 9
work page 2024
-
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Neural Information Processing Systems (NeurIPS), 2017. 7
work page 2017
-
[50]
arXiv preprint arXiv:2407.02398 , year=
Ling Yang, Zixiang Zhang, Zhilong Zhang, Xingchao Liu, Minkai Xu, Wentao Zhang, Chenlin Meng, Stefano Ermon, and Bin Cui. Consistency flow matching: Defining straight flows with velocity consistency. arXiv preprint arXiv:2407.02398, 2024. 2, 5 12
-
[51]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2
work page 2024
-
[52]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. In International Conference on Learning Representations (ICLR), 2025. 9
work page 2025
-
[53]
Linqi Zhou, Stefano Ermon, and Jiaming Song. Inductive moment matching. arXiv preprint arXiv:2503.07565, 2025. 1, 2, 5, 6, 7, 8, 9
-
[54]
Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In International Conference on Machine Learning (ICML), 2024. 2 13 Appendix A Implementation Table 4: Configurations on ImageNet 256×256. B/4 is our ablation model....
work page 2024
-
[55]
by a loss-adaptive weight λ ∝ ∥∆∥2(γ−1) 2 . In practice, we follow [15] and weight by: w = 1/(∥∆∥2 2 + c)p, (22) where p = 1 − γ and c > 0 is a small constant to avoid division by zero. If p = 0.5, this is similar to the Pseudo-Huber loss in [43]. The adaptively weighted loss is sg(w) · L, where sg denotes the stop-gradient operator. B.3 On the Sufficienc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.