pith. machine review for the scientific record. sign in

arxiv: 2603.18093 · v2 · submitted 2026-03-18 · 💻 cs.CV

Recognition: no theorem link

One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords anomaly generationdiffusion modelsself-attention graftingfew-shot synthesisindustrial anomaly detectiontraining-free methodattention control
0
0 comments X

The pith

O2MAG generates realistic text-guided anomalies from one reference image by grafting self-attention across parallel diffusion processes without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces O2MAG to solve the shortage of anomalous images in industrial anomaly detection. It synthesizes new anomalies by running three parallel diffusion processes and transferring self-attention features from a single real anomalous reference while using the anomaly mask to prevent foreground-background confusion. Additional steps align the output with text prompts and strengthen attention inside the masked regions so the results match real anomaly patterns more closely than trained alternatives. If the method works, detection models can be trained on larger and more faithful anomalous datasets without the time cost of learning new generators. Experiments show the generated data improves downstream detection accuracy over prior state-of-the-art synthesis approaches.

Core claim

O2MAG manipulates three parallel diffusion processes via self-attention grafting from one reference anomalous image, incorporates the anomaly mask to avoid query confusion, applies Anomaly-Guided Optimization to close the gap between text prompts and true anomaly semantics, and uses Dual-Attention Enhancement to reinforce self- and cross-attention on masked areas, thereby producing synthetic anomalies that adhere to real anomalous distributions.

What carries the argument

Self-attention grafting from a single reference anomaly across three parallel diffusion processes, combined with mask handling, Anomaly-Guided Optimization, and Dual-Attention Enhancement.

If this is right

  • The generated anomalies follow real distributions more closely than those from prior few-shot methods.
  • Downstream anomaly detection models trained on the synthetic data achieve higher performance on real test images.
  • Text prompts can guide the type of anomaly produced without requiring retraining.
  • No training step is needed, so new reference anomalies can be used immediately for synthesis.
  • Dual-attention reinforcement inside masks reduces faint or incomplete anomaly generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grafting approach could be tested for synthesizing rare events in video or 3D data where only one example exists.
  • If attention transfer remains stable across domains, the technique might apply to other few-shot image editing tasks beyond industrial defects.
  • The results suggest diffusion-model attention layers carry reusable structural information that can be repurposed across images without fine-tuning.
  • Evaluating the method on medical or satellite imagery would show whether one reference suffices when anomaly appearance varies widely.

Load-bearing premise

That grafting self-attention from one reference anomaly and applying the listed optimizations produces outputs whose distribution matches real anomalies without any model training.

What would settle it

If anomaly detectors trained on O2MAG-generated data show no accuracy gain over detectors trained on data from existing trained synthesis methods when evaluated on standard real-world industrial test sets.

Figures

Figures reproduced from arXiv: 2603.18093 by Caifeng Shan, Chenyang Si, Fang Zhao, Haoxiang Rao, Yan Lyu, Yuanyi Duan, Zhao Wang.

Figure 1
Figure 1. Figure 1: Left: Comparison of diffusion-based anomaly generation. Training-based approaches either (i) add defect block to learn the anomaly distribution or (ii) train embeddings by textual-inversion to mimic anomalous visual styles; whereas, (iii) existing training-free method, AnomalyAny, fails to express precise and realistic anomaly semantics while (iv) our proposed training-free O2MAG delivers background-faithf… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed O2MAG. Our method synthesize anomalies by coordinating self-attention in three parallel diffusion [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) The intermediate reconstruction during the iterative [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Anomaly-Guided Optimization (AGO) pipeline. Num [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of generated results on MVTec [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of generated hazelnut-hole. supports this setting. SeaS is designed to use unbalanced abnormal prompt P= “a <ob> with <df1>, <df2>, . . . , <dfN >” (N{=}4 ), where <ob> captures the normal [cls] ap￾pearance and <dfn> encodes the [anomaly type] features. For cross-category transfer, we retrain SeaS with wood– hole anomaly images and hazelnut normal images so that the anomaly token <df… view at source ↗
Figure 8
Figure 8. Figure 8: The intermediate reconstruction during the iterative de [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Analysis of the optimization step in AGO. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Anomaly generation under zero-shot settings. We aim to transfer anomalous features of the same anomaly type from “Reference [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative ablation of TriAG, AGO, and DAE for anomaly realism and localization. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison of generated results on VisA. The sub-image in the lower right corner is anomaly mask. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Limitations of Self-Attention for Tiny Anomaly Generation. The middle columns show the three leading principal components [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Generalization to real-world scenes [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Logical anomaly generation. Suboptimal on small anomaly generation. As discussed in Sec. 3.2, self-attention is applied to intermediate feature maps at 64\times 64 , 32\times 32 , 16\times 16 , and 8\times 8 . The ref￾erence anomaly image is accordingly downsampled to the corresponding spatial resolution. When the anomalous re￾gion is small, its representation in the self-attention space tends to exhibit … view at source ↗
Figure 17
Figure 17. Figure 17: bottle qualitative results on MVTec-AD [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: cable qualitative results on MVTec-AD [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: capsule qualitative results on MVTec-AD. [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: carpet qualitative results on MVTec-AD [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: AnomalyDiffusion fails to synthesize the intended defects within the anomaly mask; DualAnoDiff and SeaS do not preserve [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: hazelnut qualitative results on MVTec-AD. [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: leather qualitative results on MVTec-AD. [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: metal nut qualitative results on MVTec-AD [PITH_FULL_IMAGE:figures/full_fig_p023_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: pill qualitative results on MVTec-AD [PITH_FULL_IMAGE:figures/full_fig_p024_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: screw qualitative results on MVTec-AD [PITH_FULL_IMAGE:figures/full_fig_p024_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: tile qualitative results on MVTec-AD [PITH_FULL_IMAGE:figures/full_fig_p025_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: toothbrush qualitative results on MVTec-AD. [PITH_FULL_IMAGE:figures/full_fig_p025_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: transistor qualitative results on MVTec-AD. [PITH_FULL_IMAGE:figures/full_fig_p025_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: wood qualitative results on MVTec-AD [PITH_FULL_IMAGE:figures/full_fig_p026_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: zipper qualitative results on MVTec-AD [PITH_FULL_IMAGE:figures/full_fig_p026_31.png] view at source ↗
read the original abstract

Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes O2MAG, a training-free few-shot anomaly generation method that grafts self-attention maps from a single reference anomalous image into three parallel diffusion processes, incorporates anomaly masks to reduce foreground-background confusion, introduces Anomaly-Guided Optimization to align text prompts with anomaly semantics, and applies Dual-Attention Enhancement to reinforce attention on masked regions, with the goal of synthesizing text-guided anomalies that closely match real anomalous distributions and improve downstream industrial anomaly detection performance.

Significance. If the distributional fidelity and downstream superiority claims hold with rigorous evidence, the work would represent a meaningful contribution to anomaly synthesis by removing the training requirement common to prior few-shot methods while leveraging diffusion attention control for realism, potentially enabling more effective data augmentation in data-scarce industrial settings.

major comments (3)
  1. [Method] Method section (core grafting procedure): Grafting self-attention from only a single reference anomaly fixes the spatial and semantic attention patterns to that specific example; no component (mask handling, optimization, or dual enhancement) is shown to expand support to intra-class variability such as differing defect shapes, sizes, or textures, directly undermining the claim that outputs 'closely adhere to real anomalous distributions'.
  2. [Abstract and Experiments] Abstract and Experiments: The abstract asserts 'extensive experiments' and 'superior performance over prior state-of-the-art methods on downstream AD tasks', yet provides no quantitative metrics (AUROC/AUPR deltas, FID/MMD scores, or ablation tables); this absence leaves the central fidelity and superiority claims without load-bearing evidence.
  3. [Abstract and Method] Abstract and Method (Anomaly-Guided Optimization): The method is positioned as training-free, but the optimization step count, learning rate, and grafting strength are listed as free parameters; their presence requires per-instance tuning and partially contradicts the training-free framing while introducing potential sensitivity not addressed in the claims.
minor comments (2)
  1. [Method] Clarify the exact formulation of the anomaly mask integration into the attention computation with an equation reference to avoid ambiguity in foreground-background query handling.
  2. [Experiments] Ensure all result figures include side-by-side comparisons with baselines and quantitative captions rather than relying solely on qualitative visuals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and indicate the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section (core grafting procedure): Grafting self-attention from only a single reference anomaly fixes the spatial and semantic attention patterns to that specific example; no component (mask handling, optimization, or dual enhancement) is shown to expand support to intra-class variability such as differing defect shapes, sizes, or textures, directly undermining the claim that outputs 'closely adhere to real anomalous distributions'.

    Authors: The single-reference grafting provides the base attention pattern, yet the stochastic denoising in diffusion, combined with text conditioning, anomaly masks, Anomaly-Guided Optimization, and Dual-Attention Enhancement, permits controlled variations in defect appearance within the masked region. We agree that explicit evidence of intra-class variability was insufficient and have added new qualitative examples and a quantitative variability analysis (using shape and texture metrics) in the revised experiments section. revision: yes

  2. Referee: [Abstract and Experiments] Abstract and Experiments: The abstract asserts 'extensive experiments' and 'superior performance over prior state-of-the-art methods on downstream AD tasks', yet provides no quantitative metrics (AUROC/AUPR deltas, FID/MMD scores, or ablation tables); this absence leaves the central fidelity and superiority claims without load-bearing evidence.

    Authors: The full manuscript already contains the requested quantitative results (AUROC/AUPR gains, FID scores, and ablation tables) in Section 4. To make these claims immediately visible, we have revised the abstract to include key numerical deltas and added explicit cross-references to the tables and figures. revision: yes

  3. Referee: [Abstract and Method] Abstract and Method (Anomaly-Guided Optimization): The method is positioned as training-free, but the optimization step count, learning rate, and grafting strength are listed as free parameters; their presence requires per-instance tuning and partially contradicts the training-free framing while introducing potential sensitivity not addressed in the claims.

    Authors: Training-free denotes the lack of any model training or fine-tuning; the listed values are fixed hyperparameters used uniformly across all instances and datasets. We have revised the abstract and method section to state the exact fixed values (e.g., 50 steps, lr = 0.01) and added a short robustness analysis showing performance stability under small perturbations of these values. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation consists of explicit algorithmic operations on diffusion attention

full rationale

The paper describes O2MAG as a training-free procedure that grafts self-attention maps from a single reference anomaly into three parallel diffusion processes, augments them with an anomaly mask to reduce query confusion, applies Anomaly-Guided Optimization for text alignment, and uses Dual-Attention Enhancement on masked regions. These steps are presented as direct manipulations of attention and diffusion trajectories without any parameter fitting whose output is then relabeled as a prediction, without self-definitional equations, and without load-bearing self-citations that substitute for independent justification. The central claim—that the resulting samples adhere to real anomalous distributions—is an empirical assertion tested on downstream AD tasks rather than a quantity that reduces to the inputs by algebraic identity or construction. No equations or algorithmic reductions in the provided description equate the generated distribution to the reference by tautology.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The approach rests on the assumption that diffusion-model attention maps can be directly transferred to control anomaly semantics and that the introduced optimizations align generation without introducing distribution shift.

free parameters (2)
  • attention grafting strength
    The degree to which self-attention from the reference image is blended into the target generation process is a tunable choice not derived from first principles.
  • optimization step count and learning rate
    Anomaly-Guided Optimization requires unspecified iteration count and step size to steer toward the target distribution.
axioms (2)
  • domain assumption Self-attention maps in diffusion models encode semantic anomaly features that can be grafted across images while preserving background fidelity.
    Invoked when describing self-attention grafting and mask incorporation.
  • domain assumption Text prompts for anomalies can be aligned to visual distributions via gradient-based optimization inside the diffusion sampling loop.
    Basis for Anomaly-Guided Optimization.
invented entities (2)
  • Anomaly-Guided Optimization no independent evidence
    purpose: Align encoded text prompts with true anomaly semantics during synthesis.
    New optimization procedure introduced to close the semantic gap.
  • Dual-Attention Enhancement no independent evidence
    purpose: Reinforce self- and cross-attention on masked anomalous regions to avoid faint synthesis.
    New attention reinforcement step during generation.

pith-pipeline@v0.9.0 · 5551 in / 1319 out tokens · 28920 ms · 2026-05-15T09:42:21.858387+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

  1. [1]

    Anomalycontrol: few-shot anomaly generation by controlnet inpainting.IEEE Access, 2024

    Musawar Ali, Nicola Fioraio, Samuele Salti, and Luigi Di Setfano. Anomalycontrol: few-shot anomaly generation by controlnet inpainting.IEEE Access, 2024. 2

  2. [2]

    Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection

    Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019. 2, 6

  3. [3]

    Demystifying MMD GANs

    Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018. 6

  4. [4]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 3, 5

  5. [5]

    Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023. 3

  6. [6]

    MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness

    JaeHyuck Choi, MinJun Kim, and JeHyeong Hong. Magic: Mask-guided diffusion inpainting with multi-level pertur- bations and context-aware alignment for few-shot anomaly generation.arXiv preprint arXiv:2507.02314, 2025. 2

  7. [7]

    Generating and reweighting dense contrastive pat- terns for unsupervised anomaly detection

    Songmin Dai, Yifan Wu, Xiaoqiang Li, and Xiangyang Xue. Generating and reweighting dense contrastive pat- terns for unsupervised anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1454– 1462, 2024

  8. [8]

    Seas: Few-shot industrial anomaly image gen- eration with separation and sharing fine-tuning, 2025

    Zhewei Dai, Shilei Zeng, Haotian Liu, Xurui Li, Feng Xue, and Yu Zhou. Seas: Few-shot industrial anomaly image gen- eration with separation and sharing fine-tuning, 2025. 2, 3, 6, 7, 8, 5

  9. [9]

    Few- shot defect image generation via defect-aware feature manip- ulation

    Yuxuan Duan, Yan Hong, Li Niu, and Liqing Zhang. Few- shot defect image generation via defect-aware feature manip- ulation. InProceedings of the AAAI conference on artificial intelligence, pages 571–578, 2023. 2, 6, 7, 8, 3

  10. [10]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 2, 3

  11. [11]

    Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020. 2

  12. [12]

    Few-shot anomaly-driven generation for anomaly classification and segmentation, 2025

    Guan Gui, Bin-Bin Gao, Jun Liu, Chengjie Wang, and Yun- sheng Wu. Few-shot anomaly-driven generation for anomaly classification and segmentation, 2025. 2

  13. [13]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4, 7, 8

  14. [14]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 3

  15. [15]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

  16. [16]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 1

  17. [17]

    Anomalyxfu- sion: Multi-modal anomaly synthesis with diffusion.arXiv preprint arXiv:2404.19444, 2024

    Jie Hu, Yawen Huang, Yilin Lu, Guoyang Xie, Guan- nan Jiang, Yefeng Zheng, and Zhichao Lu. Anomalyxfu- sion: Multi-modal anomaly synthesis with diffusion.arXiv preprint arXiv:2404.19444, 2024. 2

  18. [18]

    Anomalyd- iffusion: Few-shot anomaly image generation with diffusion model

    Teng Hu, Jiangning Zhang, Ran Yi, Yuzhen Du, Xu Chen, Liang Liu, Yabiao Wang, and Chengjie Wang. Anomalyd- iffusion: Few-shot anomaly image generation with diffusion model. InProceedings of the AAAI conference on artificial intelligence, pages 8526–8534, 2024. 2, 6, 7, 8, 3, 5

  19. [19]

    Deep learning advance- ments in anomaly detection: A comprehensive survey.IEEE Internet of Things Journal, 2025

    Haoqi Huang, Ping Wang, Jianhua Pei, Jiacheng Wang, Sha- hen Alexanian, and Dusit Niyato. Deep learning advance- ments in anomaly detection: A comprehensive survey.IEEE Internet of Things Journal, 2025. 1

  20. [20]

    Dual-interrelated diffusion model for few-shot anomaly image generation

    Ying Jin, Jinlong Peng, Qingdong He, Teng Hu, Jiafu Wu, Hao Chen, Haoxuan Wang, Wenbing Zhu, Mingmin Chi, Jun Liu, et al. Dual-interrelated diffusion model for few-shot anomaly image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30420– 30429, 2025. 2, 6, 7, 8, 3, 5

  21. [21]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 6

  22. [22]

    Cutpaste: Self-supervised learning for anomaly de- tection and localization

    Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly de- tection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9664–9674, 2021. 2

  23. [23]

    Promptad: Learn- ing prompts with only normal samples for few-shot anomaly detection

    Xiaofan Li, Zhizhong Zhang, Xin Tan, Chengwei Chen, Yanyun Qu, Yuan Xie, and Lizhuang Ma. Promptad: Learn- ing prompts with only normal samples for few-shot anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16838– 16848, 2024. 2

  24. [24]

    A comprehensive augmenta- tion framework for anomaly detection

    Jiang Lin and Yaping Yan. A comprehensive augmenta- tion framework for anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8742– 8749, 2024. 1

  25. [25]

    Compositional visual generation with composable diffusion models

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean conference on computer vision, pages 423–439. Springer, 2022. 5

  26. [26]

    Which training methods for gans do actually converge? In International conference on machine learning, pages 3481–

    Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International conference on machine learning, pages 3481–

  27. [27]

    De- fect image sample generation with gan for improving defect recognition.IEEE Transactions on Automation Science and Engineering, 17(3):1611–1622, 2020

    Shuanlong Niu, Bin Li, Xinggang Wang, and Hui Lin. De- fect image sample generation with gan for improving defect recognition.IEEE Transactions on Automation Science and Engineering, 17(3):1611–1622, 2020. 2

  28. [28]

    Few-shot image generation via cross-domain correspondence

    Utkarsh Ojha, Yijun Li, Jingwan Lu, Alexei A Efros, Yong Jae Lee, Eli Shechtman, and Richard Zhang. Few-shot image generation via cross-domain correspondence. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10743–10752, 2021. 6

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3, 5

  30. [30]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3, 4, 6, 1

  31. [31]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 4, 7

  32. [32]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2, 3

  33. [33]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

  34. [34]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 4

  35. [35]

    Defectfill: Realistic defect generation with inpainting diffusion model for visual inspection

    Jaewoo Song, Daemin Park, Kanghyun Baek, Sangyub Lee, Jooyoung Choi, Eunji Kim, and Sungroh Yoon. Defectfill: Realistic defect generation with inpainting diffusion model for visual inspection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18718–18727,

  36. [36]

    Unseen visual anomaly generation

    Han Sun, Yunkang Cao, Hao Dong, and Olga Fink. Unseen visual anomaly generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25508– 25517, 2025. 2, 3, 6, 7

  37. [37]

    Plug-and-play diffusion features for text-driven image-to-image translation

    Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023. 3

  38. [38]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4, 6

  39. [39]

    Multi-party collaborative attention control for image customization

    Han Yang, Chuanguang Yang, Qiuli Wang, Zhulin An, Weilun Feng, Libo Huang, and Yongjun Xu. Multi-party collaborative attention control for image customization. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7942–7951, 2025. 3

  40. [40]

    Explicit boundary guided semi-push- pull contrastive learning for supervised anomaly detection

    Xincheng Yao, Ruoqi Li, Jing Zhang, Jun Sun, and Chongyang Zhang. Explicit boundary guided semi-push- pull contrastive learning for supervised anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24490–24499, 2023. 2

  41. [41]

    Qianzi Yu, Kai Zhu, Yang Cao, Feijie Xia, and Yu Kang. Tf2: Few-shot text-free training-free defect image genera- tion for industrial anomaly inspection.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11825– 11837, 2024. 3, 6, 7, 8

  42. [42]

    Draem- a discriminatively trained reconstruction embedding for sur- face anomaly detection

    Vitjan Zavrtanik, Matej Kristan, and Danijel Skoˇcaj. Draem- a discriminatively trained reconstruction embedding for sur- face anomaly detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8330– 8339, 2021. 2, 8

  43. [43]

    Defect-gan: High-fidelity defect synthesis for automated defect inspection

    Gongjie Zhang, Kaiwen Cui, Tzu-Yi Hung, and Shijian Lu. Defect-gan: High-fidelity defect synthesis for automated defect inspection. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 2524–2534, 2021. 2

  44. [44]

    A photo of a [cls] with a [anomalytype]

    Ximiao Zhang, Min Xu, and Xiuzhuang Zhou. Realnet: A feature selection network with realistic synthetic anomaly for anomaly detection. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16699–16708, 2024. 2 One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control Supplementary Material Ov...