pith. machine review for the scientific record. sign in

arxiv: 2604.13863 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

Dongpu Cao, Gang Chen, Hao Chen, Hongchang Chen, Jieming Zhang, Ying Li, Yujie Lei, Yushi Liu, Zebei Tong, Zhi Zheng

Pith reviewed 2026-05-10 14:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords anomaly image generationindustrial assemblydiffusion modelsgeometric priorcondition decouplingsynthetic dataDreamAssembly datasetanomaly detection
0
0 comments X

The pith

A diffusion model generates industrial anomaly images that respect component assembly poses and relationships.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PostureObjectStitch to create synthetic anomaly images of assembled industrial parts where each component sits in its correct pose and orientation. Standard generation methods produce images that ignore these physical constraints, so the results cannot train reliable anomaly detectors for real assembly lines. The method decouples multi-view inputs into texture, high-frequency, and RGB features, modulates them over diffusion steps for progressive detail, and applies a geometric prior plus conditional loss to lock in semantic accuracy and proper positioning. If this holds, manufacturers could generate large volumes of usable training data to compensate for the rarity of real anomalies. Experiments across the MureCom dataset, the new DreamAssembly dataset, and a downstream detection task support the approach.

Core claim

PostureObjectStitch separates multi-view images into high-frequency, texture, and RGB features via condition decoupling, then adapts these features across diffusion time-steps through temporal modulation to build consistent coarse-to-fine outputs. A conditional loss strengthens key industrial elements while a geometric prior directs component placement to satisfy assembly relationships.

What carries the argument

Condition decoupling of multi-view inputs into separate feature streams, combined with temporal modulation in diffusion and a geometric prior that enforces assembly relationships.

If this is right

  • The generated images can supplement limited real anomaly data to train stronger industrial detection models.
  • Progressive generation maintains multi-view consistency while adding fine details only after coarse structure is set.
  • The method is shown to outperform prior techniques on the MureCom dataset and the contributed DreamAssembly dataset.
  • Downstream anomaly detection performance improves when models are trained with the assembly-aware synthetic images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar geometric priors could be tested in other constrained generation tasks such as robotic scene assembly or mechanical part layouts.
  • The feature decoupling step may help diffusion models in any domain where multiple input views must remain consistent with physical structure.
  • If the prior scales without heavy tuning, it offers a route to reduce manual labeling in quality-control pipelines for complex products.

Load-bearing premise

The geometric prior and conditional loss together force generated images to show correct component positions and semantics without creating new misalignments or visual artifacts.

What would settle it

Quantitative metrics or visual checks on real assembled industrial images showing that generated parts are often rotated or shifted relative to each other in violation of the claimed assembly rules.

Figures

Figures reproduced from arXiv: 2604.13863 by Dongpu Cao, Gang Chen, Hao Chen, Hongchang Chen, Jieming Zhang, Ying Li, Yujie Lei, Yushi Liu, Zebei Tong, Zhi Zheng.

Figure 1
Figure 1. Figure 1: Fantastic application of our proposed PostureObjectstitch in industrial anomaly generation considering assembly [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PostureObjectstitch. Given N reference images of a specific sample, PostureObjectstitch fine-tunes the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: OCR auxiliary loss. 3.2 Time-feature modulation Based on the consensus from prior work [24, 35, 42, 44, 45], the im￾age generation process of diffusion models follows a coarse-to-fine paradigm. As demonstrated in [9], features at different hierarchical levels play distinct roles across different timesteps. Motivated by this observation, we introduce a time-feature modulation module that associates timestep… view at source ↗
Figure 4
Figure 4. Figure 4: Pose and orientation prior fusion. information. In image replacement tasks, most existing methods focus solely on the consistency of object morphology and category before and after replacement, while neglecting the consistency of geometric information, which is crucial for maintaining assembly relationships in industrial scenarios. To address this issue, we intro￾duce pose and orientation prior information… view at source ↗
Figure 5
Figure 5. Figure 5: Dreamassembly dataset overview. Background:The background images are collected from real industrial environments [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of different methods on the MureCom dataset. For each row, we present the background image with [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of different methods on our DreamAssembly dataset. For each row, we present the background image [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

Image generation technology can synthesize condition-specific images to supplement real-world industrial anomaly data and enhance anomaly detection model performance. Existing generation techniques rarely account for the pose and orientation of industrial components in assembly, making the generated images difficult to utilize for downstream application. To solve this, we propose a novel image synthesis approach, called PostureObjectStitch, that achieves accurate generation to meet the requirement of industrial assembly. A condition decoupling approach is introduced to separate input multi-view images into high-frequency, texture, and RGB features. The feature temporal modulation mechanism adapts these features across diffusion model time-steps, enabling progressive generation from coarse to fine details while maintaining consistency. To ensure semantic accuracy, we introduce a conditional loss that enhances critical industrial elements and a geometric prior that guides component positioning for correct assembly relationships. Comprehensive experimental results on the MureCom dataset, our newly contributed DreamAssembly dataset, and the downstream application validate the outstanding performance of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes PostureObjectStitch, a diffusion-based image synthesis method for generating anomaly images of industrial components that respects assembly relationships. It decouples multi-view inputs into high-frequency, texture, and RGB features, applies feature temporal modulation across diffusion timesteps for coarse-to-fine generation, introduces a conditional loss to emphasize critical elements, and employs a geometric prior to enforce correct component positioning. The approach is evaluated on the MureCom dataset, the newly contributed DreamAssembly dataset, and a downstream anomaly detection task, with claims of superior performance over existing methods.

Significance. If the central claims are substantiated, the work addresses an important gap in synthetic data generation for industrial anomaly detection by explicitly modeling assembly poses and relationships, which prior diffusion-based approaches largely ignore. The release of the DreamAssembly dataset represents a concrete, reusable contribution that could benchmark future methods in this domain. The combination of geometric priors with conditional losses in a diffusion framework offers a technically grounded direction for controllable generation in structured scenes.

major comments (1)
  1. [Experimental Results] The central technical claim—that the geometric prior and conditional loss successfully enforce correct assembly relationships and semantic accuracy—rests on indirect evidence from downstream anomaly detection gains rather than direct quantitative validation. No metrics for pose deviation, component alignment error, overlap, or geometric fidelity on the generated DreamAssembly outputs are reported, leaving open the possibility that performance improvements arise from texture realism or other factors unrelated to the proposed priors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the importance of modeling assembly relationships in industrial anomaly image generation. We address the single major comment below.

read point-by-point responses
  1. Referee: The central technical claim—that the geometric prior and conditional loss successfully enforce correct assembly relationships and semantic accuracy—rests on indirect evidence from downstream anomaly detection gains rather than direct quantitative validation. No metrics for pose deviation, component alignment error, overlap, or geometric fidelity on the generated DreamAssembly outputs are reported, leaving open the possibility that performance improvements arise from texture realism or other factors unrelated to the proposed priors.

    Authors: We agree that direct quantitative metrics would provide stronger and more isolated evidence for the contribution of the geometric prior and conditional loss. The current evaluation relies on downstream anomaly detection performance on DreamAssembly (plus qualitative results), which demonstrates practical utility but does not directly quantify geometric fidelity. In the revised manuscript we will add explicit metrics on the generated DreamAssembly outputs, including pose deviation, component alignment error, and overlap ratios, computed by comparing synthesized assemblies against the known ground-truth configurations provided in the dataset. These will be reported alongside the existing downstream results to better separate the effect of the proposed priors from general improvements in texture or realism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; novel components added to standard diffusion models

full rationale

The paper proposes independent additions (condition decoupling, feature temporal modulation, conditional loss, geometric prior) to diffusion models and contributes a new DreamAssembly dataset. These are described as new mechanisms for enforcing assembly relationships and semantic accuracy rather than being defined in terms of the outputs they produce or fitted to the target results by construction. Experimental validation on MureCom, DreamAssembly, and downstream tasks is presented without any quoted reduction of predictions to inputs, self-citation chains, or ansatz smuggling. The derivation chain is self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Based on abstract only; central claim rests on standard diffusion model assumptions plus newly introduced geometric prior and conditional loss whose effectiveness is asserted but not detailed.

free parameters (1)
  • loss weighting coefficients
    Weights balancing conditional loss and geometric prior are likely tuned but unspecified in abstract.
axioms (1)
  • domain assumption Diffusion models conditioned on decoupled multi-view features can produce consistent progressive generation from coarse to fine while preserving assembly semantics.
    Invoked to justify the feature temporal modulation mechanism.
invented entities (1)
  • geometric prior no independent evidence
    purpose: Guides component positioning to maintain correct assembly relationships.
    Newly introduced to ensure semantic accuracy in generated images.

pith-pipeline@v0.9.0 · 5487 in / 1326 out tokens · 62400 ms · 2026-05-10T14:13:25.247549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 50 canonical work pages · 7 internal anchors

  1. [1]

    Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. 2023. A Neural Space-Time Representation for Text-to-Image Personalization. arXiv:2305.15391 [cs.CV] https://arxiv.org/abs/2305.15391

  2. [2]

    James Betker, Gabriel Goh, Li Jing,†TimBrooks, Jianfeng Wang, Linjie Li,†Lon- gOuyang,†JuntangZhuang,†JoyceLee,†YufeiGuo,†WesamManassra,†Praful- laDhariwal,†CaseyChu,†YunxinJiao, and Aditya Ramesh. [n. d.]. Improving Im- age Generation with Better Captions. https://api.semanticscholar.org/CorpusID: 264403242

  3. [3]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv:2211.09800 [cs.CV] https: //arxiv.org/abs/2211.09800

  4. [4]

    Jiaxuan Chen, Bo Zhang, Qingdong He, Jinlong Peng, and Li Niu. 2025. Mure- ObjectStitch: Multi-reference Image Composition. arXiv:2411.07462 [cs.CV] https://arxiv.org/abs/2411.07462

  5. [5]

    Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Heng- shuang Zhao. 2024. AnyDoor: Zero-shot Object-level Image Customization. arXiv:2307.09481 [cs.CV] https://arxiv.org/abs/2307.09481

  6. [6]

    Songmin Dai, Yifan Wu, Xiaoqiang Li, and Xiangyang Xue. 2023. Generating and Reweighting Dense Contrastive Patterns for Unsupervised Anomaly Detection. arXiv:2312.15911 [cs.CV] https://arxiv.org/abs/2312.15911

  7. [7]

    Zhewei Dai, Shilei Zeng, Haotian Liu, Xurui Li, Feng Xue, and Yu Zhou. 2025. SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning. arXiv:2410.14987 [cs.CV] https://arxiv.org/abs/2410.14987

  8. [8]

    Dalal and B

    N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detec- tion. In2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. 886–893 vol. 1. doi:10.1109/CVPR.2005.177

  9. [9]

    Dale Decatur, Thibault Groueix, Wang Yifan, Rana Hanocka, Vladimir Kim, and Matheus Gadelha. 2025. Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets. InProceedings of the IEEE/CVF International Conference on Computer Vision. 16482–16491

  10. [10]

    Ziyi Dong, Pengxu Wei, and Liang Lin. 2025. DreamArtist++: Control- lable One-Shot Text-to-Image Generation via Positive-Negative Adapter. arXiv:2211.11337 [cs.CV] https://arxiv.org/abs/2211.11337

  11. [11]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. arXiv:2208.01618 [cs.CV] https://arxiv.org/abs/2208.01618

  12. [12]

    Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, and Mike Zheng Shou. 2023. Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models. arXiv:2305.18292 [cs.CV] https://arxiv.org/abs/2305.18292

  13. [13]

    Teng Hu, Jiangning Zhang, Ran Yi, Yuzhen Du, Xu Chen, Liang Liu, Yabiao Wang, and Chengjie Wang. 2024. AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model. arXiv:2312.05767 [cs.CV] https://arxiv.org/ abs/2312.05767

  14. [14]

    Zongxiang Hu and Zhaosheng Zhang. 2024. SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection. arXiv:2407.03634 [cs.CV] https://arxiv.org/abs/2407.03634

  15. [15]

    Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang, Bizhu Wu, Guany- ing Chen, Wei Liu, and Wenhan Luo. 2024. OMG: Occlusion-friendly Personal- ized Multi-concept Generation in Diffusion Models. arXiv:2403.10983 [cs.CV] https://arxiv.org/abs/2403.10983

  16. [16]

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun- Yan Zhu. 2023. Multi-Concept Customization of Text-to-Image Diffusion. arXiv:2212.04488 [cs.CV] https://arxiv.org/abs/2212.04488

  17. [17]

    Dongxu Li, Junnan Li, and Steven C. H. Hoi. 2023. BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing. arXiv:2305.14720 [cs.CV] https://arxiv.org/abs/2305.14720

  18. [18]

    Tianle Li, Max Ku, Cong Wei, and Wenhu Chen. 2023. DreamEdit: Subject-driven Image Editing. arXiv:2306.12624 [cs.CV] https://arxiv.org/abs/2306.12624

  19. [19]

    Yuanwei Li, Elizaveta Ivanova, and Martins Bruveris. 2024. FADE: Few- shot/zero-shot Anomaly Detection Engine using Large Vision-Language Model. arXiv:2409.00556 [cs.CV] https://arxiv.org/abs/2409.00556

  20. [20]

    Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, and Ying Shan. 2025. IC-Custom: Diverse Image Customization via In-Context Learning. arXiv:2507.01926 [cs.CV] https://arxiv.org/abs/2507.01926

  21. [22]

    Jianxiang Lu, Cong Xie, and Hui Guo. 2024. Object-Driven One- Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding. arXiv:2401.15708 [cs.CV] https://arxiv.org/abs/2401.15708

  22. [23]

    Lingxiao Lu, Jiangtong Li, Bo Zhang, and Li Niu. 2024. DreamCom: Finetuning Text-guided Inpainting Model for Image Composition. arXiv:2309.15508 [cs.CV] https://arxiv.org/abs/2309.15508

  23. [24]

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timo- fte, and Luc Van Gool. 2022. RePaint: Inpainting using Denoising Diffusion Probabilistic Models. arXiv:2201.09865 [cs.CV] https://arxiv.org/abs/2201.09865

  24. [25]

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photo- realistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741 [cs.CV] https://arxiv.org/abs/2112.10741

  25. [26]

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952 [cs.CV] https://arxiv.org/abs/2307.01952

  26. [27]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV] https://arxiv.org/ abs/2103.00020

  27. [28]

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

  28. [29]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125 [cs.CV] https://arxiv.org/abs/2204.06125

  29. [30]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV] https://arxiv.org/abs/2112.10752

  30. [31]

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. arXiv:2208.12242 [cs.CV] https://arxiv.org/abs/ 2208.12242

  31. [32]

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. 2024. Hyper- DreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models. arXiv:2307.06949 [cs.CV] https://arxiv.org/abs/2307.06949

  32. [33]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Den- ton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Lan- guage Understanding. arXiv:2205.11487 [cs.CV] https:/...

  33. [34]

    Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, Yuan Hao, Irfan Essa, Michael Rubinstein, and Dilip Krishnan. 2023. StyleDrop: Text-to-Image Genera- tion in Any Style. arXiv:2306.00983 [cs.CV] https://arxiv.org/abs/2306.00983

  34. [35]

    Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, and Daniel Aliaga. 2022. ObjectStitch: Generative Object Composit- ing. arXiv:2212.00932 [cs.CV] https://arxiv.org/abs/2212.00932

  35. [36]

    Mingyu Sung, Il-Min Kim, Sangseok Yun, and Jae-Mo Kang. 2025. H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models. arXiv:2510.27171 [cs.CV] https://arxiv.org/abs/ 2510.27171

  36. [37]

    Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. 2024. Key-Locked Rank One Editing for Text-to-Image Personalization. arXiv:2305.01644 [cs.CV] https://arxiv.org/abs/2305.01644

  37. [38]

    Anton Voronov, Mikhail Khoroshikh, Artem Babenko, and Max Ryabinin. 2023. Is This Loss Informative? Faster Text-to-Image Customization by Tracking Ob- jective Dynamics. arXiv:2302.04841 [cs.CV] https://arxiv.org/abs/2302.04841

  38. [39]

    Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman

  39. [40]

    P+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023

    P+: Extended Textual Conditioning in Text-to-Image Generation. arXiv:2303.09522 [cs.CV] https://arxiv.org/abs/2303.09522

  40. [41]

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. 2024. InstantID: Zero-shot Identity-Preserving Generation in Seconds. arXiv:2401.07519 [cs.CV] https://arxiv.org/abs/2401.07519

  41. [42]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing13, 4 (2004), 600–612. doi:10.1109/TIP.2003.819861

  42. [43]

    Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, and Tingbo Hou. 2023. HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models. arXiv:2312.00079 [cs.CV] https://arxiv.org/ abs/2312.00079

  43. [44]

    Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, Christian Rupprecht, Daniel Cremers, Peter Vajda, and Jialiang Wang. 2024. Cache Me if You Can: Accelerating Diffusion Models through Block Caching. arXiv:2312.03209 [cs.CV] https://arxiv.org/abs/2312.03209

  44. [45]

    Chendong Xiang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2023. A Closer Look at Parameter-Efficient Tuning in Diffusion Models. arXiv:2303.18181 [cs.CV] https://arxiv.org/abs/2303.18181

  45. [46]

    Jinqi Xiao, Miao Yin, Yu Gong, Xiao Zang, Jian Ren, and Bo Yuan. 2023. COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models. arXiv:2305.17235 [cs.CV] https://arxiv.org/abs/2305.17235 Conference’17, July 2017, Washington, DC, USA Trovato et al

  46. [47]

    Yuhao Xu, Tao Gu, Weifeng Chen, and Chengcai Chen. 2024. OOTDiffu- sion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on. arXiv:2403.01779 [cs.CV] https://arxiv.org/abs/2403.01779

  47. [48]

    Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. 2022. Paint by Example: Exemplar-based Image Editing with Diffusion Models. arXiv:2211.13227 [cs.CV] https://arxiv.org/abs/ 2211.13227

  48. [49]

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv:2308.06721 [cs.CV] https://arxiv.org/abs/2308.06721

  49. [50]

    Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, and Huicheng Zheng. 2023. Inserting Anybody in Diffusion Models via Celeb Basis. arXiv:2306.00926 [cs.CV] https://arxiv.org/abs/2306.00926

  50. [51]

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models. arXiv:2302.05543 [cs.CV] https: //arxiv.org/abs/2302.05543

  51. [52]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang

  52. [53]

    Efros, Eli Shechtman, and Oliver Wang

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv:1801.03924 [cs.CV] https://arxiv.org/abs/1801.03924

  53. [54]

    Yiheng Zhang, Yunkang Cao, Xiaohao Xu, and Weiming Shen. 2024. LogiCode: an LLM-Driven Framework for Logical Anomaly Detection. arXiv:2406.04687 [cs.LG] https://arxiv.org/abs/2406.04687

  54. [55]

    Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. 2025. Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202(2025)

  55. [56]

    Wenbing Zhu, Lidong Wang, Ziqing Zhou, Chengjie Wang, Yurui Pan, Ruoyi Zhang, Zhuhao Chen, Linjie Cheng, Bin-Bin Gao, Jiangning Zhang, Zhenye Gan, Yuxie Wang, Yulong Chen, Shuguang Qian, Mingmin Chi, Bo Peng, and Lizhuang Ma. 2025. Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection. arXiv:2504.14221 [cs.CV] https://arxiv.or...