arxiv: 2605.11480 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Efficient Adjoint Matching for Fine-tuning Diffusion Models

Dongsoo Shin, Jaemoo Choi, Jaewoong Choi, Jeongwoo Shin, Joonseok Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords diffusion modelsreward fine-tuningadjoint matchingstochastic optimal controltext-to-image generationtraining efficiencypreference alignment

0 comments

The pith

By reformulating the stochastic optimal control problem with linear base drift, Efficient Adjoint Matching speeds up reward fine-tuning of diffusion models up to 4x while matching performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to reduce the high computational cost of reward fine-tuning for text-to-image diffusion models. Standard Adjoint Matching requires expensive full-trajectory stochastic simulations and backward adjoint ODE solves because of the pretrained model's non-trivial drift. EAM changes the underlying control problem to use linear base drift plus an adjusted terminal cost, which permits few-step deterministic ODE sampling and yields an exact closed-form adjoint. This change keeps the alignment goal intact and delivers faster convergence on benchmarks without sacrificing metric scores.

Core claim

EAM reformulates the stochastic optimal control problem for reward fine-tuning by replacing the pretrained model's non-trivial base drift with linear base drift and introducing a correspondingly modified terminal cost. The new formulation supports training-time sampling with a few-step deterministic ODE solver and supplies a closed-form adjoint solution that removes the need for backward simulation along each trajectory.

What carries the argument

Reformulation of the SOC problem to linear base drift with modified terminal cost, which removes full stochastic trajectory sampling and backward adjoint ODE integration.

Load-bearing premise

That replacing the pretrained model's non-trivial drift with linear base drift plus a modified terminal cost still solves the original alignment objective without adding bias or losing solution quality.

What would settle it

A side-by-side run on the same text-to-image reward benchmarks where EAM reaches lower final scores on PickScore, ImageReward, or HPSv2.1 than standard Adjoint Matching after the same number of steps.

Figures

Figures reproduced from arXiv: 2605.11480 by Dongsoo Shin, Jaemoo Choi, Jaewoong Choi, Jeongwoo Shin, Joonseok Lee.

**Figure 1.** Figure 1: Comparison of Efficient Adjoint Matching (EAM) with Adjoint Matching (AM). (Left) AM relies on a stochastic SDE solver to construct each training trajectory and a sequential backward simulation to obtain the adjoint state along that trajectory. (Right) EAM eliminates both: intermediate states Xt are obtained by first simulating the endpoint X1 with a few-step ODE and then sampling Xt from the original nois… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison. See App. D for more examples. via redesigning the base dynamic. As shown in Algorithm 1, we simulate X1 with efficient ODE solver (17) and construct the intermediate state Xt by sampling from the original noising kernel qt(·|X1) Then, we minimize adjoint matching loss (13), which does not need adjoint ODE simulation. 4 Experiments 4.1 Experimental Settings Setup. We fine-tune Stable… view at source ↗

**Figure 3.** Figure 3: PickScore on DrawBench by training time (GPU hours). EAM converges significantly faster than AM (up to 4×). Quantitative results. As shown in Tab. 2, EAM consistently matches or outperforms AM across most metrics in both the single-reward and multi-reward settings. Both EAM and AM substantially improve over the pretrained SD3.5-M baseline, while SD3.5- M with Classifier-Free-Guidance (CFG) [17] attains the… view at source ↗

**Figure 4.** Figure 4: Qualitative Results. AM and EAM denote models fine-tuned using PickScore, while AM + Multi and EAM + Multi denote models fine-tuned using a combination of PickScore, HPSv2.1, and Aesthetics. v [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative Results. AM and EAM denote models fine-tuned using PickScore, while AM + Multi and EAM + Multi denote models fine-tuned using a combination of PickScore, HPSv2.1, and Aesthetics. vi [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Results. AM and EAM denote models fine-tuned using PickScore, while AM + Multi and EAM + Multi denote models fine-tuned using a combination of PickScore, HPSv2.1, and Aesthetics. vii [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textit{non-trivial base drift} inherited from the pretrained model. Motivated by this observation, we propose \textbf{Efficient Adjoint Matching (EAM)}, which substantially improves training efficiency by reformulating the SOC problem with a \textit{linear base drift} and a correspondingly modified \textit{terminal cost}. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAM swaps the base drift to linear plus a terminal cost tweak to drop stochastic sampling and adjoint simulation, delivering 4x faster training with metric parity on the reported benchmarks.

read the letter

The main thing here is that the authors replace the pretrained model's non-linear drift with a linear one and adjust the terminal cost so the SOC problem admits deterministic few-step sampling and a closed-form adjoint. That directly removes the two big costs in standard Adjoint Matching: full stochastic trajectory simulation and backward ODE passes for the adjoint state. On the text-to-image reward fine-tuning tasks they test, this yields up to 4x faster convergence while matching or beating the original AM on PickScore, ImageReward, HPSv2.1, CLIPScore, and Aesthetics.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Efficient Adjoint Matching (EAM) for reward fine-tuning of pretrained diffusion and flow models. It reformulates the stochastic optimal control (SOC) problem underlying Adjoint Matching (AM) by replacing the pretrained model's non-linear base drift with a linear one and adjusting the terminal cost accordingly. This enables few-step deterministic ODE sampling at training time and a closed-form adjoint solution, removing the need for stochastic trajectory simulation and backward adjoint ODE integration. On standard text-to-image benchmarks, EAM is reported to converge up to 4x faster than AM while matching or exceeding it on PickScore, ImageReward, HPSv2.1, CLIPScore, and Aesthetics.

Significance. If the reformulation is mathematically equivalent to the original AM objective, the work would offer a practical advance in computational efficiency for preference alignment of generative models, addressing the high cost of full-trajectory stochastic simulation and adjoint integration that currently limits adjoint-based fine-tuning methods.

major comments (2)

[§3] §3 (reformulation of the SOC problem): The central claim that switching to linear base drift plus modified terminal cost preserves the original alignment objective and solution quality is not supported by a derivation showing exact compensation for the drift change under the pretrained noise schedule and score function. Without this, the observed metric parity could arise from optimization on a surrogate rather than faithful recovery of AM's optimum.
[§4.2, Table 2] §4.2 and Table 2 (experimental results): The ablation on sampling steps and convergence speed does not include a direct comparison (e.g., KL divergence or reward distribution distance) between the EAM-learned policy and the original AM solution; this measurement is necessary to confirm that the 4x speedup does not come at the cost of a different alignment distribution.

minor comments (2)

[§3.1] Notation for the linear drift term is introduced without explicit connection to standard linear SDE forms (e.g., Ornstein-Uhlenbeck), which may reduce accessibility for readers familiar with diffusion literature.
[Figure 4] The convergence curves in Figure 4 would be clearer with shaded variance bands from multiple random seeds to substantiate the robustness of the reported speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we respond point-by-point to the major comments, clarifying the mathematical basis of the reformulation and addressing the experimental validation. We indicate where revisions will be incorporated.

read point-by-point responses

Referee: [§3] §3 (reformulation of the SOC problem): The central claim that switching to linear base drift plus modified terminal cost preserves the original alignment objective and solution quality is not supported by a derivation showing exact compensation for the drift change under the pretrained noise schedule and score function. Without this, the observed metric parity could arise from optimization on a surrogate rather than faithful recovery of AM's optimum.

Authors: We thank the referee for this important point. In §3 the reformulation is obtained by substituting the linear base drift into the SOC objective and deriving the compensating adjustment to the terminal cost so that the resulting value function (and therefore the optimal policy) remains identical to that of the original AM problem. The compensation is exact because the difference in the controlled probability paths induced by the drift change is fully absorbed into the modified terminal cost under the fixed pretrained noise schedule; the score function of the pretrained model enters only through the original dynamics and is unchanged. To eliminate any ambiguity we will expand the derivation in the revised §3 with an explicit step-by-step expansion showing the cancellation of the drift-induced terms, thereby confirming that EAM recovers the same alignment objective. revision: yes
Referee: [§4.2, Table 2] §4.2 and Table 2 (experimental results): The ablation on sampling steps and convergence speed does not include a direct comparison (e.g., KL divergence or reward distribution distance) between the EAM-learned policy and the original AM solution; this measurement is necessary to confirm that the 4x speedup does not come at the cost of a different alignment distribution.

Authors: We agree that an explicit distributional comparison would be desirable. Direct KL divergence between the two policies is computationally intractable in the high-dimensional image space. Nevertheless, the reported metrics already demonstrate that EAM achieves parity or better performance than AM on the same reward functions that define the alignment objective. In the revised manuscript we will add, in the appendix, a side-by-side comparison of reward histograms and a few additional proxy statistics (e.g., mean and variance of the reward under both policies) on the same set of prompts to further support distributional similarity. revision: partial

Circularity Check

0 steps flagged

No significant circularity; reformulation is an explicit modeling choice with independent empirical validation

full rationale

The paper's central derivation consists of an explicit reformulation of the SOC objective that replaces the pretrained model's non-linear base drift with a linear one and adjusts the terminal cost to enable closed-form adjoint and few-step ODE sampling. This change is presented as a deliberate modeling decision motivated by computational bottlenecks, not as a derivation that reduces to its own inputs by construction. No parameters are fitted to a subset and then relabeled as predictions, no self-citations are load-bearing for uniqueness or ansatz, and the final performance claims rest on standard benchmark comparisons rather than tautological equivalence. The derivation chain therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed from abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond the standard SOC framing of AM.

axioms (1)

domain assumption Adjoint Matching can be cast as a stochastic optimal control problem with non-trivial base drift from the pretrained model.
Stated directly in the abstract as the starting point for both AM and EAM.

pith-pipeline@v0.9.0 · 5548 in / 1216 out tokens · 31275 ms · 2026-05-13T01:48:08.979614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

Albergo, N

M. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 2025

work page 2025
[2]

Black, M

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. InICLR, 2024

work page 2024
[3]

Blessing, J

D. Blessing, J. Berner, L. Richter, C. Domingo-Enrich, Y . Du, A. Vahdat, and G. Neumann. Trust region constrained measure transport in path space for stochastic optimal control and inference. InNeurIPS, 2025

work page 2025
[4]

J. Choi, Y . Zhu, W. Guo, P. Molodyk, B. Yuan, J. Bai, Y . Xin, M. Tao, and Y . Chen. Rethinking the design space of reinforcement learning for diffusion models: On the importance of likelihood estimation beyond loss design. InICML, 2026

work page 2026
[5]

Clark, P

K. Clark, P. Vicol, K. Swersky, and D. J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024

work page 2024
[6]

Domingo-Enrich, M

C. Domingo-Enrich, M. Drozdzal, B. Karrer, and R. T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. InICLR, 2025

work page 2025
[7]

B. Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

work page 2011
[8]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[9]

Fan and K

Y . Fan and K. Lee. Optimizing DDPM sampling with shortcut fine-tuning. InICML, 2023

work page 2023
[10]

Y . Fan, O. Watkins, Y . Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. DPOK: Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2023

work page 2023
[11]

W. Guo, J. Choi, Y . Zhu, M. Tao, and Y . Chen. Proximal diffusion neural sampler. InICML, 2026

work page 2026
[12]

X. Guo, M. Cui, L. Bo, and D. Huang. ShortFT: Diffusion model alignment via shortcut-based fine-tuning. InICCV, 2025

work page 2025
[13]

Havens, B

A. Havens, B. K. Miller, B. Yan, C. Domingo-Enrich, A. Sriram, B. Wood, D. Levine, B. Hu, B. Amos, B. Karrer, X. Fu, G.-H. Liu, and R. T. Q. Chen. Adjoint sampling: Highly scalable diffusion samplers via adjoint matching. InICML, 2025

work page 2025
[14]

X. He, S. Fu, Y . Zhao, W. Li, J. Yang, D. Yin, F. Rao, and B. Zhang. TempFlow-GRPO: When timing matters for grpo in flow models. InICLR, 2026

work page 2026
[15]

Hessel, A

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi. CLIPScore: A reference-free evaluation metric for image captioning. InEMNLP, 2021

work page 2021
[16]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

work page 2020
[17]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

work page 2022
[19]

Kirstain, A

Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. InNeurIPS, 2023

work page 2023
[20]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InICLR, 2023

work page 2023
[21]

G.-H. Liu, J. Choi, Y . Chen, B. K. Miller, and R. T. Q. Chen. Adjoint schrödinger bridge sampler. In NeurIPS, 2025

work page 2025
[22]

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang. Flow-GRPO: Training flow matching models via online rl. InNeurIPS, 2025

work page 2025
[23]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 10

work page 2023
[24]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InICLR, 2019

work page 2019
[25]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 2025

work page 2025
[26]

Aligning text-to- image diffusion models with reward backpropagation (2023).arXiv preprint arXiv:2310.03739,

M. Prabhudesai, A. Goyal, D. Pathak, and K. Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv:2310.03739, 2024

work page arXiv 2024
[27]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022
[28]

Saharia, W

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, D. Fleet, and M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022

work page 2022
[29]

Schuhmann

C. Schuhmann. LAION-AESTHETICS, Aug. 2022. https://laion.ai/blog/laion-aesthetics/. Accessed: 2026-04-30

work page 2022
[30]

Y . Shi, V . De Bortoli, A. Campbell, and A. Doucet. Diffusion schrödinger bridge matching. InNeurIPS, 2023

work page 2023
[31]

J. Shin, J. Sul, J. Lee, J. Choi, and J. Choi. Efficient generative modeling beyond memoryless diffusion via adjoint schrödinger bridge matching. InICML, 2026

work page 2026
[32]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021
[33]

J. Wang, J. Liang, J. Liu, H. Liu, G. Liu, J. Zheng, W. Pang, A. Ma, Z. Xie, X. Wang, M. Wang, P. Wan, and X. Liang. GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping. arXiv:2510.22319, 2025

work page arXiv 2025
[34]

Y . Wang, Z. Li, Y . Zang, Y . Zhou, J. Bu, C. Wang, Q. Lu, C. Jin, and J. Wang. Pref-GRPO: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

X. Wu, Y . Hao, M. Zhang, K. Sun, Z. Huang, G. Song, Y . Liu, and H. Li. Deep reward supervisions for tuning text-to-image diffusion models. InECCV, 2024

work page 2024
[36]

X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li. Human preference score: Better aligning text-to-image models with human preference. InICCV, 2023

work page 2023
[37]

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023

work page 2023
[38]

S. Xue, C. Ge, S. Zhang, Y . Li, and Z.-M. Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv:2509.25050, 2025

work page arXiv 2025
[39]

Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huang, and P. Luo. DanceGRPO: Unleashing grpo on visual generation.arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

H. Ye, K. Zheng, J. Xu, P. Li, H. Chen, J. Han, S. Liu, Q. Zhang, H. Mao, Z. Hao, P. Chattopadhyay, D. Yang, L. Feng, M. Liao, J. Bai, M.-Y . Liu, J. Zou, and S. Ermon. Data-regularized reinforcement learning for diffusion models at scale.arXiv:2512.04332, 2025

work page arXiv 2025
[41]

H. Zhao, H. Chen, J. Zhang, D. D. Yao, and W. Tang. Score as action: Fine-tuning diffusion generative models by continuous-time reinforcement learning. InICML, 2025

work page 2025
[42]

Zheng, H

K. Zheng, H. Chen, H. Ye, H. Wang, Q. Zhang, K. Jiang, H. Su, S. Ermon, J. Zhu, and M.-Y . Liu. DiffusionNFT: Online diffusion reinforcement with forward process. InICLR, 2026. 11 Appendix A Impact Satement This work develops computational methods for reward-based fine-tuning of diffusion models. Our study is theoretical and computational in nature and us...

work page 2026
[43]

(41) yields the announced family D(t) = 2Ct2 −1 t(2Ct 2 −2t+ 1) , C > 1 2 ,(44) which is Eq

Substituting this It into Eq. (41) yields the announced family D(t) = 2Ct2 −1 t(2Ct 2 −2t+ 1) , C > 1 2 ,(44) which is Eq. (15). Step 5: the second constraint is automatic.Using the explicit forms of It, Φt = (2C− 1)t/(2Ct 2 −2t+ 1) , and Jt =I 1 −I t Φ2 t , a direct computation (using ¯αt = Φ0Jt/(ΦtI1) and γ2 t = 2I tJt/I1) verifies that ¯α2 t +γ 2 t = (...

work page
[44]

(39) therefore imposes no additional restriction onD(t)

The second constraint in Eq. (39) therefore imposes no additional restriction onD(t). ii Step 6: terminal distribution.At t= 1 , Φ0 is the value of Φ at t= 0 , which evaluates to Φ0 = 0 (since the numerator (2C−1)t vanishes at t= 0 ). Hence ¯α1 = 0, and X1 = ¯β1X1 +γ 1 ε marginalizes (usingX 0 ∼ N(0, I)and ¯β1 = 1) to pbase 1 =N 0,(2C−1)I ,(45) sinceI 1 =...

work page
[45]

Exact adjoint calculation.Plugging the linear base drift in Eq

The family is therefore unique up to the single scalarC. Exact adjoint calculation.Plugging the linear base drift in Eq. (15) into Eq. (10) yields a(t;X t) = (2C−1)t 2Ct2 −2t+ 1 a(1;X 1), a(1;X 1) =∇g(X 1).(46) B.4 Proof of Proposition 3.2 We use the standard SOC reduction for reward-tilted sampling [6, 13]: under the SOC problem (4)– (5) with terminal co...

work page 2000
[46]

CFG is known to improve alignment between the generated image and the text prompt, but this comes at the cost of doubled NFEs required to compute the unconditional prediction

in this section. CFG is known to improve alignment between the generated image and the text prompt, but this comes at the cost of doubled NFEs required to compute the unconditional prediction. As shown in Tab. 4, applying CFG improves improves all the metrics except Aesthetics. D Additional Qualitative Examples iv S D 3 . 5 S D 3 . 5 + C F G A M A M + M u...

work page