pith. machine review for the scientific record. sign in

arxiv: 2604.12322 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Self-Adversarial One Step Generation via Condition Shifting

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords one-step generationtext-to-image synthesisflow modelscondition shiftingself-adversarial trainingLoRA tuningimage generation efficiency
0
0 comments X

The pith

Condition shifting in flow models extracts internal adversarial signals for high-quality one-step image generation without external discriminators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that adversarial correction signals can be derived directly from a flow model by shifting its input condition, creating a velocity field that estimates the current generation distribution. This produces a gradient aligned with GAN objectives, eliminating the need for separate discriminator networks that often cause instability and vanishing gradients. A sympathetic reader would care because the method keeps the original architecture intact, supports efficient parameter updates like LoRA, and delivers one-step outputs that match or exceed the quality of multi-step sampling while cutting inference time substantially.

Core claim

By applying a transformation to create a shifted condition branch, the velocity field of the flow model becomes an independent estimator of its own generation distribution. This supplies a provably GAN-aligned gradient that replaces sample-dependent discriminator terms, enabling stable training for one-step sampling. The resulting framework is architecture-preserving and compatible with both full-parameter and LoRA-based tuning.

What carries the argument

Condition shifting, which produces a shifted condition branch whose velocity field serves as an estimator of the model's generation distribution to supply GAN-aligned gradients for one-step correction.

Load-bearing premise

The velocity field from the shifted condition accurately estimates the generation distribution and yields a stable, GAN-aligned gradient that improves one-step outputs.

What would settle it

An ablation that disables condition shifting during training and then measures whether one-step sample quality falls back to levels seen in standard regression or consistency distillation without adversarial benefits.

Figures

Figures reproduced from arXiv: 2604.12322 by Chuyan Chen, Deyuan Liu, Peng Sun, Tao Lin, Yansen Han, Zhenglin Cheng.

Figure 1
Figure 1. Figure 1: An overview of generated images. ∗Equal contribution †Corresponding author. 1 arXiv:2604.12322v1 [cs.CV] 14 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative Analysis between APEX and existing methods under different NFEs. where ω(t) = t 1−t > 0. The apparent equivalence dissolves once we recognize that the derivation treats vθ itself as a proxy for the score of pfake, its descent signal degenerates into self regression. We replace this proxy with vfake the independent estimator of pfake’s velocity field constructed in Section 3.1 . Because vfake wa… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Comparison of 512x512 in APEX 20B LoRA for NFE=1 [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison of 512x512 in APEX 20B LoRA for NFE=1 [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Comparison of 512x512 in APEX 20B LoRA for NFE=1. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparison of 512x512 in APEX 0.6B LoRA for NFE=1. D VISUALIZATIONS PART II [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Comparison of 512x512 in APEX 20B LoRA for NFE=1. E VISUALIZATIONS PART III 28 [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Comparison of 512x512 in APEX 20B LoRA for NFE=1 [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Comparison of 512x512 in APEX 20B LoRA for NFE=1. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative Comparison of 512x512 in APEX 20B Full Parameter Tuning for NFE=1 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative Comparison of 512x512 in APEX 20B Full Parameter Tuning for NFE=1. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative Comparison of 512x512 in APEX 20B Full Parameter Tuning for NFE=1 [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative Comparison of 512x512 in Qwen-Image Lightning LoRA for NFE=1. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative Comparison of 512x512 in Qwen-Image Lightning LoRA for NFE=1 [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative Comparison of 512x512 in Qwen-Image Lightning LoRA for NFE=1. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative Comparison of 512x512 in 20B Full Parameter Tuning of APEX methods and Synthetic dataset from NFE=1 to NFE=20 [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative Comparison of 512x512 in 20B Full Parameter Tuning of APEX methods and BLIP-3o dataset from NFE=1 to NFE=20 [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative Comparison of 512x512 in 20B Full Parameter Tuning of sCM methods and BLIP-3o dataset from NFE=1 to NFE=20 [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative Comparison of 512x512 in 20B Full Parameter Tuning of CTM methods and BLIP-3o dataset from NFE=1 to NFE=20. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative Comparison of 512x512 in 20B Full Parameter Tuning of MeanFlow methods and BLIP-3o dataset from NFE=1 to NFE=20. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
read the original abstract

The push for efficient text to image synthesis has moved the field toward one step sampling, yet existing methods still face a three way tradeoff among fidelity, inference speed, and training efficiency. Approaches that rely on external discriminators can sharpen one step performance, but they often introduce training instability, high GPU memory overhead, and slow convergence, which complicates scaling and parameter efficient tuning. In contrast, regression based distillation and consistency objectives are easier to optimize, but they typically lose fine details when constrained to a single step. We present APEX, built on a key theoretical insight: adversarial correction signals can be extracted endogenously from a flow model through condition shifting. Using a transformation creates a shifted condition branch whose velocity field serves as an independent estimator of the model's current generation distribution, yielding a gradient that is provably GAN aligned, replacing the sample dependent discriminator terms that cause gradient vanishing. This discriminator free design is architecture preserving, making APEX a plug and play framework compatible with both full parameter and LoRA based tuning. Empirically, our 0.6B model surpasses FLUX-Schnell 12B (20$\times$ more parameters) in one step quality. With LoRA tuning on Qwen-Image 20B, APEX reaches a GenEval score of 0.89 at NFE=1 in 6 hours, surpassing the original 50-step teacher (0.87) and providing a 15.33$\times$ inference speedup. Code is available https://github.com/LINs-lab/APEX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces APEX, a discriminator-free framework for one-step text-to-image generation that extracts endogenous adversarial correction signals from a flow model via condition shifting. The shifted condition branch produces a velocity field claimed to act as an independent estimator of the current generation distribution, yielding a provably GAN-aligned gradient that replaces sample-dependent discriminator terms. The method is architecture-preserving and compatible with full-parameter or LoRA tuning. Empirically, a 0.6B model is reported to surpass FLUX-Schnell (12B parameters) in one-step quality, while LoRA tuning on a 20B Qwen-Image teacher achieves a GenEval score of 0.89 at NFE=1 (surpassing the 50-step teacher's 0.87) with a 15.33× inference speedup.

Significance. If the central theoretical claim holds and the reported metrics are reproducible, APEX could meaningfully advance efficient one-step sampling by sidestepping the instability and memory costs of external discriminators while retaining higher fidelity than pure regression or consistency distillation. The parameter-efficient tuning results and claimed outperformance of much larger models would be notable contributions to scaling one-step generators.

major comments (3)
  1. [Abstract / Theoretical Insight] The core claim that condition shifting produces a 'provably GAN aligned' gradient (Abstract) rests on the shifted velocity field serving as an independent distribution estimator. However, because the original and shifted branches share model weights and training dynamics, the resulting gradient may be correlated rather than adversarial; this needs explicit analysis or a counterexample showing that the construction avoids reducing to self-regularization.
  2. [Abstract / Experiments] The empirical claim that the 0.6B APEX model surpasses FLUX-Schnell 12B in one-step quality requires a detailed comparison protocol (metrics, prompts, evaluation settings) and ablation on whether the gain is attributable to the adversarial signal versus other training choices; without this, the 20× parameter reduction result cannot be assessed as load-bearing evidence.
  3. [Abstract / Experiments] The LoRA tuning result on the 20B model (GenEval 0.89 at NFE=1 after 6 hours, surpassing the 50-step teacher) is presented without reporting variance across runs, baseline LoRA without the condition-shifting term, or memory/GPU-hour comparisons to standard distillation; these controls are necessary to substantiate the training-efficiency advantage.
minor comments (1)
  1. [Abstract] The abstract states 'Code is available' with a GitHub link; confirm that the repository includes the exact training scripts, condition-shifting implementation, and evaluation code used for the reported numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work. We address each major comment point-by-point below with clarifications from the manuscript and commitments to revisions that strengthen the theoretical and empirical sections without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract / Theoretical Insight] The core claim that condition shifting produces a 'provably GAN aligned' gradient (Abstract) rests on the shifted velocity field serving as an independent distribution estimator. However, because the original and shifted branches share model weights and training dynamics, the resulting gradient may be correlated rather than adversarial; this needs explicit analysis or a counterexample showing that the construction avoids reducing to self-regularization.

    Authors: We appreciate the referee's careful reading of the theoretical claim. Section 3.2 derives that the condition-shifted velocity field yields an estimator whose expectation is taken over a deliberately mismatched condition distribution, producing a gradient term that is orthogonal (in expectation) to the standard flow-matching regression gradient; this is shown to match the form of a GAN discriminator gradient in Equation (8). The shared weights do not induce correlation in the adversarial direction because the shift operates on the conditioning input rather than the model parameters or noise, creating an independent distributional probe. We will add an explicit counterexample and expanded analysis in the revised manuscript to demonstrate that the construction does not reduce to self-regularization. revision: yes

  2. Referee: [Abstract / Experiments] The empirical claim that the 0.6B APEX model surpasses FLUX-Schnell 12B in one-step quality requires a detailed comparison protocol (metrics, prompts, evaluation settings) and ablation on whether the gain is attributable to the adversarial signal versus other training choices; without this, the 20× parameter reduction result cannot be assessed as load-bearing evidence.

    Authors: We agree that reproducibility requires a fuller protocol. The reported comparisons use GenEval and FID on the identical prompt sets and evaluation settings employed in the FLUX-Schnell paper, with details provided in Section 4.2. We will expand the experimental section with a complete protocol description and add an ablation isolating the condition-shifting term from other training choices to confirm that the performance gain is attributable to the endogenous adversarial signal. revision: yes

  3. Referee: [Abstract / Experiments] The LoRA tuning result on the 20B model (GenEval 0.89 at NFE=1 after 6 hours, surpassing the 50-step teacher) is presented without reporting variance across runs, baseline LoRA without the condition-shifting term, or memory/GPU-hour comparisons to standard distillation; these controls are necessary to substantiate the training-efficiency advantage.

    Authors: We acknowledge these controls would strengthen the efficiency claims. The 20B LoRA results are reported from single runs given the computational scale, but we will include variance across multiple seeds in the revision. We will also add a direct baseline of standard LoRA tuning without the condition-shifting term and memory/GPU-hour comparisons against conventional distillation methods to better substantiate the training-efficiency advantage. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes APEX as a new plug-and-play framework that extracts adversarial signals endogenously via condition shifting on a flow model, asserting that the velocity field from the shifted branch yields a provably GAN-aligned gradient. This is framed as the method's theoretical contribution rather than any output being redefined as its own input. No equations are provided that reduce a claimed prediction or result to a fitted parameter or prior definition by construction. No self-citations are invoked as load-bearing justification for uniqueness or ansatzes. Empirical results (e.g., GenEval 0.89 at NFE=1, comparisons to FLUX-Schnell) are presented as external benchmarks. The chain is self-contained with independent content in the proposed transformation and its application to one-step distillation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assumption that the shifted-condition velocity field provides an independent, GAN-aligned estimator without external components; no free parameters or invented entities are explicitly listed in the abstract.

axioms (1)
  • domain assumption The velocity field from a shifted condition branch serves as an independent estimator of the model's current generation distribution and yields a provably GAN-aligned gradient.
    This is the key theoretical insight stated in the abstract that replaces external discriminator terms.

pith-pipeline@v0.9.0 · 5586 in / 1257 out tokens · 57494 ms · 2026-05-10T15:33:30.905090+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 27 canonical work pages · 14 internal anchors

  1. [1]

    Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705,

  2. [2]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025a. Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping L...

  3. [3]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691,

  4. [4]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

  5. [5]

    Flux-reason-6m & prism-bench: A million-scale text-to-image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680, 2025

    Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, and Hongsheng Li. Flux-reason-6m & prism-bench: A million-scale text- to-image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680,

  6. [6]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557,

  7. [7]

    Seedream 3.0 Technical Report

    12 Preprint Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346,

  8. [8]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447,

  9. [9]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  10. [10]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

  11. [11]

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

    URLhttps://blackforestlabs.ai/. Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024a. Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu,...

  12. [12]

    Playground v3: Improving text-to-image alignment with deep-fusion large language models

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695,

  13. [13]

    Efficient generative model training via embedded representation warmup.arXiv preprint arXiv:2504.10188,

    Deyuan Liu, Peng Sun, Xufeng Li, and Tao Lin. Efficient generative model training via embedded representation warmup.arXiv preprint arXiv:2504.10188,

  14. [14]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081,

  15. [15]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378,

  16. [16]

    Learning in Implicit Generative Models

    Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models.arXiv preprint arXiv:1610.03483,

  17. [17]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

  18. [18]

    Transfer between Modalities with MetaQueries

    URL https://openai.com/index/ introducing-4o-image-generation/. Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256,

  19. [19]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

  20. [20]

    Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758,

  21. [21]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

  22. [22]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation.arXiv preprint arXiv:2403.12015, 2024a. Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with...

  23. [23]

    GitHub repository

    URLhttps://github.com/LINs-lab/RCGM. GitHub repository. Peng Sun, Yi Jiang, and Tao Lin. Unified continuous generative models.arXiv preprint arXiv:2505.07447,

  24. [24]

    Phased consistency model

    Fu-Yun Wang, Zhaoyang Huang, Alexander William Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased consistency model. arXiv preprint arXiv:2405.18407, 2024a. Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Em...

  25. [25]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding ...

  26. [26]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024a. Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen...

  27. [27]

    Large-scale reinforcement learning for diffusion models

    Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Direct discriminative optimization: Your likelihood-based visual generative model is secretly a gan discriminator.arXiv preprint arXiv:2503.01103,

  28. [28]

    4 3.2 From Velocity Discrepancy to KL Descent and Practical Loss

    15 Preprint CONTENTS 1 Introduction 2 2 Preliminaries 3 3 APEX 4 3.1 Building the Adversarial Reference via Condition Shifting . . . . . . . . . . . . . 4 3.2 From Velocity Discrepancy to KL Descent and Practical Loss . . . . . . . . . . . . 5 3.3 Complete Objective and GAN Gradient Structure . . . . . . . . . . . . . . . . . . 7 4 Experiments 8 4.1 Exper...

  29. [29]

    and flow matching (Lipman et al., 2022; Liu et al., 2025), involves learning aninstantaneousvelocity field. While effective for multi step integration, this first order approach is brittle under coarse discretization, as high path curvature causes truncation errors that degrade few step generation quality (Karras et al., 2022). To address this, a signific...

  30. [30]

    16 Preprint This internal adversarial signal, combined with data supervision in Lmix, drives pθ toward preal in a self contained, architecture preserving manner

    trains the shifted condition branch to track the model’s current generation errors, providing an adaptive self adversarial signal without requiring an external network. 16 Preprint This internal adversarial signal, combined with data supervision in Lmix, drives pθ toward preal in a self contained, architecture preserving manner. A.2 FROMEXTERNALDISCRIMINA...

  31. [31]

    and FSDP based distributed training (Zhao et al., 2023), limiting its use in billion parameter models. To overcome this, the field has converged on finite difference estimators, often termed Differential Derivation Equations (DDE), as a scalable alternative (Lu & Song, 2024; Wang et al., 2025). These estimators rely only on forward passes and are natively...