Meta Flow Maps enable scalable reward alignment

Abbas Mammadov; Adhi Saravanan; Alvaro Prat; Michael S. Albergo; Peter Potaptchik; Yee Whye Teh

arxiv: 2601.14430 · v2 · pith:RIMXZLO7new · submitted 2026-01-20 · 📊 stat.ML · cs.LG

Meta Flow Maps enable scalable reward alignment

Peter Potaptchik , Adhi Saravanan , Abbas Mammadov , Alvaro Prat , Michael S. Albergo , Yee Whye Teh This is my paper

classification 📊 stat.ML cs.LG

keywords flowmapsalignmentcleandatafunctioninference-timeintermediate

0 comments

read the original abstract

Controlling generative models is computationally expensive. This is because optimal alignment with a reward function--whether via inference-time steering or fine-tuning--requires estimating the value function. This task demands access to the conditional posterior $p_{1|t}(x_1|x_t)$, the distribution of clean data $x_1$ consistent with an intermediate state $x_t$, a requirement that typically compels methods to resort to costly trajectory simulations. To address this bottleneck, we introduce Meta Flow Maps (MFMs), a framework extending consistency models and flow maps into the stochastic regime. MFMs are trained to perform stochastic one-step posterior sampling, generating arbitrarily many i.i.d. draws of clean data $x_1$ from any intermediate state. Crucially, these samples provide a differentiable reparametrization that unlocks efficient value function estimation. We leverage this capability to solve bottlenecks in both paradigms: enabling inference-time steering without inner rollouts, and facilitating unbiased, off-policy fine-tuning to general rewards. Empirically, our single-particle steered-MFM sampler outperforms a Best-of-1000 baseline on ImageNet across multiple rewards at a fraction of the compute.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Strong Stochastic Flow Maps
cs.LG 2026-05 unverdicted novelty 8.0

Strong Stochastic Flow Maps learn the strong solution map of additive-noise SDEs via a pathwise-convergent polynomial Brownian approximation, generalizing deterministic flow maps and enabling simulation-free training ...
Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
cs.LG 2026-05 unverdicted novelty 7.0

Reinforce Adjoint Matching derives a simple consistency loss for RL post-training of diffusion models by tilting the clean distribution toward higher-reward samples under KL regularization while keeping the noising pr...
ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space
cs.LG 2026-04 unverdicted novelty 7.0

ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Few-step Cofolding with All-Atom Flow Maps
cs.LG 2026-06 unverdicted novelty 6.0

DeCAF distills all-atom cofolding diffusion models into few-step flow maps, showing improved or matched accuracy on protein-ligand tasks with 5x fewer inference steps.
Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching
cs.LG 2026-06 unverdicted novelty 6.0

Clari, a unit-cell flow matching model with pair-bias attention, generates organic crystal structures faster than OXtal while improving solve rates and supporting energy-based ranking without relaxation.
Are we really tilting? The mechanics of reward guidance in flow and diffusion models
cs.LG 2026-06 unverdicted novelty 6.0

Finite-particle approximation of the Doob h-function causes reward hacking via two failure modes in reward-guided diffusion; a damping schedule corrects within-mode bias in Gaussian settings.
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
cs.LG 2026-05 unverdicted novelty 6.0

Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 6.0

FMRG is a training-free single-trajectory guidance framework for flow-based models that matches or exceeds baselines on reward-guided tasks and inverse problems using as few as 3 NFEs.
Measure-to-measure Regression with Transformers
cs.LG 2026-05 unverdicted novelty 5.0

Formalizes nonlinear M2M regression and introduces transformer architectures as static maps and dynamic velocity fields between probability measures, tested on synthetic, particle, and organoid datasets.