pith. machine review for the scientific record. sign in

arxiv: 2604.24351 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.AI· cs.CV· cs.SE

Recognition: unknown

Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

Hong Zhang, Yingda Chen, Zhongjie Duan

Pith reviewed 2026-05-08 04:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.SE
keywords diffusion modelscontrollable generationplugin frameworkmodularitycomposabilityimage editingmodel zoocapability injection
0
0 comments X

The pith

Diffusion Templates decouple control capabilities from specific diffusion models using a shared plugin interface for composable generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Diffusion Templates to solve the fragmentation where controllable diffusion methods are built as isolated, backbone-specific systems with incompatible pipelines and formats. It organizes the solution around three components: Template models that convert task-specific inputs into an intermediate capability representation, a Template cache that provides a standardized interface for injecting those capabilities, and a Template pipeline that loads, merges, and applies one or more caches into the base diffusion runtime. Because the interface sits at the systems level, it accommodates different carriers such as KV-Cache and LoRA under one abstraction. The authors demonstrate this by constructing a model zoo for tasks including structural control, image editing, super-resolution, inpainting, and aesthetic alignment. A sympathetic reader would care because the design promises that new controls and new backbones can be added without rewriting core inference code or losing the ability to combine multiple controls in one run.

Core claim

The central claim is that defining control injection at the systems level through Template models, a Template cache, and a Template pipeline unifies a broad range of controllable diffusion tasks while preserving modularity and extensibility. The Template cache serves as the key standardized interface that accepts heterogeneous capability carriers, allowing the same pipeline to support structural, color, editing, and other controls without tying the abstraction to any single control architecture. This enables a single runtime to load and compose multiple Template caches across evolving diffusion backbones.

What carries the argument

The Template cache, which functions as a standardized, architecture-independent interface for injecting capabilities from Template models into the base diffusion runtime.

If this is right

  • Multiple controls such as inpainting and aesthetic alignment can be composed within a single generation without custom integration code.
  • Capabilities developed for one diffusion backbone can be transferred to others by swapping the base model while keeping the same Template caches.
  • New tasks can be added by implementing only a Template model and cache, reusing the existing pipeline and training infrastructure.
  • The framework supports ongoing evolution of diffusion backbones without forcing reimplementation of existing controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Community contributions could accumulate as a shared library of Templates, reducing duplicated effort across research groups.
  • A common runtime might make direct comparisons between control methods more straightforward by isolating differences to the Template layer.
  • The same abstraction could be tested for compatibility with non-diffusion generative models if the cache interface is kept general.

Load-bearing premise

That an interface defined at the systems level can support all relevant control mechanisms without needing changes specific to each control architecture.

What would settle it

A new control technique that cannot be expressed as a Template model plus cache without modifying the base diffusion model's inference code or runtime hooks.

Figures

Figures reproduced from arXiv: 2604.24351 by Hong Zhang, Yingda Chen, Zhongjie Duan.

Figure 1
Figure 1. Figure 1: Overview of Diffusion Templates framework. view at source ↗
Figure 2
Figure 2. Figure 2: Structural control results with a shared depth input. Prompt 1: “A cat is sitting on a view at source ↗
Figure 3
Figure 3. Figure 3: Brightness adjustment results with a shared prompt: “A cat is sitting on a stone.” view at source ↗
Figure 4
Figure 4. Figure 4: Color adjustment results with a shared prompt: “A cat is sitting on a stone.” view at source ↗
Figure 5
Figure 5. Figure 5: Image editing results with a shared reference image. Prompt 1: “Put a hat on this cat.” view at source ↗
Figure 6
Figure 6. Figure 6: Super-resolution results with a shared prompt: “A cat is sitting on a stone.” view at source ↗
Figure 7
Figure 7. Figure 7: Sharpness control results with a shared prompt: “A cat is sitting on a stone.” view at source ↗
Figure 8
Figure 8. Figure 8: Aesthetic alignment results with a shared prompt: “A cat is sitting on a stone.” view at source ↗
Figure 9
Figure 9. Figure 9: Content-reference results with a shared prompt: “A cat is sitting on a stone.” view at source ↗
Figure 10
Figure 10. Figure 10: Local inpainting results. Prompt 1: “An orange cat is sitting on a stone.” Prompt 2: view at source ↗
Figure 11
Figure 11. Figure 11: Age control results with a shared prompt: “A portrait of a woman with black hair, view at source ↗
Figure 12
Figure 12. Figure 12: Fusion of super-resolution and sharpness enhancement capabilities, producing images view at source ↗
Figure 13
Figure 13. Figure 13: Fusion of structural control, image editing, and color adjustment, enabling the generation view at source ↗
Figure 14
Figure 14. Figure 14: Fusion of structural control, sharpness enhancement, and aesthetic alignment, yielding view at source ↗
Figure 15
Figure 15. Figure 15: Fusion of local inpainting, image editing, and brightness adjustment, enabling localized view at source ↗
read the original abstract

Controllable diffusion methods have substantially expanded the practical utility of diffusion models, but they are typically developed as isolated, backbone-specific systems with incompatible training pipelines, parameter formats, and runtime hooks. This fragmentation makes it difficult to reuse infrastructure across tasks, transfer capabilities across backbones, or compose multiple controls within a single generation pipeline. We present Diffusion Templates, a unified and open plugin framework that decouples base-model inference from controllable capability injection. The framework is organized around three components: Template models that map arbitrary task-specific inputs to an intermediate capability representation, a Template cache that functions as a standardized interface for capability injection, and a Template pipeline that loads, merges, and injects one or more Template caches into the base diffusion runtime. Because the interface is defined at the systems level rather than tied to a specific control architecture, heterogeneous capability carriers such as KV-Cache and LoRA can be supported under the same abstraction. Based on this design, we build a diverse model zoo spanning structural control, brightness adjustment, color adjustment, image editing, super-resolution, sharpness enhancement, aesthetic alignment, content reference, local inpainting, and age control. These case studies show that Diffusion Templates can unify a broad range of controllable generation tasks while preserving modularity, composability, and practical extensibility across rapidly evolving diffusion backbones. All resources will be open sourced, including code, models, and datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Diffusion Templates, a unified plugin framework for controllable diffusion that decouples base-model inference from capability injection via three components: Template models (mapping task inputs to intermediate representations), a Template cache (standardized injection interface), and a Template pipeline (loading, merging, and injecting caches into the runtime). It claims this systems-level abstraction supports heterogeneous carriers (e.g., KV-Cache, LoRA) without architecture-specific ties, enabling unification, modularity, composability, and extensibility. The authors present a model zoo of ten case studies (structural control, brightness/color adjustment, editing, super-resolution, sharpness, aesthetic alignment, content reference, inpainting, age control) and commit to open-sourcing code, models, and datasets.

Significance. If the abstraction truly permits uniform handling of disparate carriers and backbones while preserving quality and efficiency, the work could meaningfully reduce fragmentation in controllable diffusion by enabling infrastructure reuse, cross-backbone transfer, and multi-control composition. The explicit open-sourcing of all resources is a concrete strength that would support reproducibility and community extensions.

major comments (2)
  1. [Abstract] Abstract (design description of Template cache and pipeline): the claim that 'the interface is defined at the systems level rather than tied to a specific control architecture' allowing KV-Cache and LoRA under the same abstraction is load-bearing for the unification claim, yet the manuscript provides no concrete specification, pseudocode, or merge logic showing how distinct operations (runtime tensor modification for KV-Cache vs. weight-matrix updates for LoRA) are dispatched uniformly without carrier-specific handlers inside the standardized interface.
  2. [Abstract] Abstract (case studies paragraph): the assertion that the framework 'can unify a broad range of controllable generation tasks while preserving modularity, composability, and practical extensibility' lacks any quantitative support; no performance metrics, quality comparisons, efficiency measurements, or ablation results are reported for the ten tasks, leaving the central claim that unification occurs 'without loss of quality or efficiency' unsubstantiated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential of Diffusion Templates to address fragmentation in controllable diffusion. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without overstating the current content.

read point-by-point responses
  1. Referee: [Abstract] Abstract (design description of Template cache and pipeline): the claim that 'the interface is defined at the systems level rather than tied to a specific control architecture' allowing KV-Cache and LoRA under the same abstraction is load-bearing for the unification claim, yet the manuscript provides no concrete specification, pseudocode, or merge logic showing how distinct operations (runtime tensor modification for KV-Cache vs. weight-matrix updates for LoRA) are dispatched uniformly without carrier-specific handlers inside the standardized interface.

    Authors: We agree that the abstract does not contain concrete specification, pseudocode, or explicit merge logic, which leaves the systems-level unification claim insufficiently supported in that section. The framework design intends for the Template cache to act as a carrier-agnostic container and for the pipeline to perform generic loading and merging, with carrier-specific dispatch handled via extensible plugins. To resolve this, we will revise the abstract to include a concise description of the interface and add pseudocode plus a merge-logic diagram to the main text in the revised manuscript. This will explicitly illustrate uniform handling at the systems level. revision: yes

  2. Referee: [Abstract] Abstract (case studies paragraph): the assertion that the framework 'can unify a broad range of controllable generation tasks while preserving modularity, composability, and practical extensibility' lacks any quantitative support; no performance metrics, quality comparisons, efficiency measurements, or ablation results are reported for the ten tasks, leaving the central claim that unification occurs 'without loss of quality or efficiency' unsubstantiated.

    Authors: We acknowledge that the abstract reports no quantitative metrics, comparisons, or ablations, so the claim of unification without loss of quality or efficiency is not numerically substantiated there. The ten case studies currently serve to demonstrate breadth and feasibility through qualitative examples across tasks and backbones. In the revision we will add quantitative evaluations, including quality metrics (e.g., FID, CLIP scores), efficiency measurements (e.g., runtime overhead), and limited baseline comparisons plus composition ablations for a subset of tasks. These additions will be supported by the open-sourced code, models, and datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: high-level systems framework with no derivations or self-referential reductions

full rationale

The paper presents a descriptive architectural framework consisting of Template models, a Template cache interface, and a Template pipeline for unifying controllable diffusion tasks. No mathematical equations, derivations, parameter fittings, or predictive claims appear in the abstract or described content. The central claim—that a systems-level interface enables support for heterogeneous carriers such as KV-Cache and LoRA—is advanced as a design property rather than derived from prior results or self-defined quantities. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is self-contained as an engineering abstraction without reduction to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on newly introduced conceptual entities and the domain assumption that a systems-level interface can unify heterogeneous controls.

axioms (1)
  • domain assumption An interface defined at the systems level can support heterogeneous capability carriers such as KV-Cache and LoRA under the same abstraction.
    Invoked when describing how the Template cache and pipeline enable unification across control types.
invented entities (3)
  • Template model no independent evidence
    purpose: Maps arbitrary task-specific inputs to an intermediate capability representation
    New component introduced to standardize inputs for the framework.
  • Template cache no independent evidence
    purpose: Functions as a standardized interface for capability injection
    Core abstraction for decoupling base inference from controls.
  • Template pipeline no independent evidence
    purpose: Loads, merges, and injects one or more Template caches into the base diffusion runtime
    Handles integration and supports composability.

pith-pipeline@v0.9.0 · 5552 in / 1396 out tokens · 112205 ms · 2026-05-08T04:20:35.980172+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.

  2. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 6.0

    BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.

Reference graph

Works this paper leans on

50 extracted references · 19 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Apiserve: Efficient api support for large-language model inferencing

    Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang. Infercept: Efficient intercept support for augmented large language model inference.arXiv preprint arXiv:2402.01869, 2024

  2. [2]

    Model context protocol specification

    Anthropic. Model context protocol specification. Technical specification, 2024.https:// modelcontextprotocol.io/

  3. [3]

    Introducing agent skills

    Anthropic. Introducing agent skills. Product announcement, 2025.https://www.anthropic. com/news/skills, accessed April 12, 2026

  4. [4]

    Flux.1 model family

    Black Forest Labs. Flux.1 model family. Technical report/model release, 2024.https:// blackforestlabs.ai/

  5. [5]

    A computational approach to edge detection.IEEE Transactions on pattern analysis and machine intelligence, 8(6):679–698, 1986

    John Canny. A computational approach to edge detection.IEEE Transactions on pattern analysis and machine intelligence, 8(6):679–698, 1986

  6. [6]

    Attrictrl: Fine-grained control of aesthetic attribute intensity in diffusion models.arXiv preprint arXiv:2508.02151, 2025

    Die Chen, Zhongjie Duan, Zhiwen Li, Cen Chen, Daoyuan Chen, Yaliang Li, and Yingda Chen. Attrictrl: Fine-grained control of aesthetic attribute intensity in diffusion models.arXiv preprint arXiv:2508.02151, 2025

  7. [7]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

  8. [8]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

  9. [9]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 17

  10. [10]

    Artaug: Enhancing text-to-image generation through synthesis-understanding interaction.arXiv preprint arXiv:2412.12888, 2024

    Zhongjie Duan, Qianyi Zhao, Cen Chen, Daoyuan Chen, Wenmeng Zhou, Yaliang Li, and Yingda Chen. Artaug: Enhancing text-to-image generation through synthesis-understanding interaction.arXiv preprint arXiv:2412.12888, 2024

  11. [11]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨ uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  12. [12]

    An image is worth one word: Personalizing text-to-image generation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InThe Eleventh International Conference on Learning Representations, 2023

  13. [13]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

  14. [14]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  15. [15]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  16. [16]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  17. [17]

    Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

    Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

  18. [18]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  19. [19]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

  20. [20]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  21. [21]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  22. [22]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024. 18

  23. [23]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Ying- fang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

  24. [24]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4296–4304, 2024

  25. [25]

    Function calling and tool use in openai models

    OpenAI. Function calling and tool use in openai models. Technical documentation, 2023. https://platform.openai.com/docs/guides/function-calling

  26. [26]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  27. [27]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨ uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  28. [28]

    Mooncake: A kvcache-centric disaggregated archi- tecture for llm serving.ACM Transactions on Storage, 2024

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache-centric disaggregated archi- tecture for llm serving.ACM Transactions on Storage, 2024

  29. [29]

    Tool learning with foundation models.ACM Computing Surveys, 57(4):1–40, 2024

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Xuanhe Zhou, Yufei Huang, Chaojun Xiao, et al. Tool learning with foundation models.ACM Computing Surveys, 57(4):1–40, 2024

  30. [30]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  31. [31]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  32. [32]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  33. [33]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  34. [34]

    Dex: Deep expectation of apparent age from a single image

    Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. InProceedings of the IEEE international conference on computer vision workshops, pages 10–15, 2015

  35. [35]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aber- man. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023. 19

  36. [36]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

  37. [37]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  38. [38]

    Preble: Efficient distributed prompt scheduling for llm serving.arXiv preprint arXiv:2407.00023, 2024

    Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. Preble: Efficient distributed prompt scheduling for llm serving.arXiv preprint arXiv:2407.00023, 2024

  39. [39]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Al- abdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  40. [40]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  41. [41]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021

  42. [42]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  43. [43]

    The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

  44. [44]

    Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

  45. [45]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  46. [46]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  47. [47]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  48. [48]

    Eligen: Entity- level controlled image generation with regional attention

    Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, and Yu Zhang. Eligen: Entity- level controlled image generation with regional attention. InProceedings of the 7th ACM International Conference on Multimedia in Asia, pages 1–7, 2025. 20

  49. [49]

    Adding conditional control to text-to- image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to- image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  50. [50]

    H2o: Heavy-hitter oracle for effi- cient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´ e, Clark Barrett, et al. H2o: Heavy-hitter oracle for effi- cient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. 21