arxiv: 2604.08915 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Large-Scale Universal Defect Generation: Foundation Models and Datasets

Bin-Bin Gao, Chengjie Wang, Jiawei Zhan, Jun Liu, Xiaochen Chen, Yuanting Fan, Yuhuan Lin, Zhewei Dai

Pith reviewed 2026-05-10 17:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords defect generationanomaly detectionfoundation modelimage editingsynthetic datasetMVTec-ADuniversal modelmultimodal diffusion

0 comments

The pith

UniDG uses a 300K-pair dataset and MM-DiT fusion to generate realistic defects from references or text instructions without per-category fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UDG, a dataset of 300,000 normal-abnormal-mask-caption quadruplets across domains, and UniDG, a foundation model for universal defect generation. UniDG performs reference-based defect insertion or text-guided editing through adaptive cropping, diptych formatting, and multimodal attention fusion. A two-stage process first maximizes diversity then enforces consistency and realism. If the claim holds, synthetic defects can train anomaly detectors that generalize better to real industrial images on benchmarks like MVTec-AD and VisA, reducing reliance on scarce real defect examples.

Core claim

UniDG is a universal defect generation model that supports both reference-based generation and text instruction-based editing without per-category fine-tuning. It achieves this via Defect-Context Editing with adaptive defect cropping and structured diptych inputs, fuses reference and target conditions through MM-DiT multimodal attention, and applies a two-stage training strategy of Diversity-SFT followed by Consistency-RFT. Experiments show it outperforms prior few-shot anomaly generation and image insertion baselines in synthesis quality and in single- and multi-class anomaly detection and localization on MVTec-AD and VisA.

What carries the argument

Defect-Context Editing mechanism that uses adaptive cropping and diptych input format, combined with MM-DiT multimodal attention for fusing reference and target conditions, trained in two stages on the UDG dataset.

If this is right

Enables single-class and multi-class anomaly detection and localization without retraining per defect category.
Improves image synthesis quality over existing few-shot generation and editing methods.
Supports both copying defects from reference images and following natural-language editing instructions.
Handles wide variation in defect scale and morphology while preserving category consistency.
Provides a large paired dataset that can serve as a foundation for further defect-related model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could lower the cost of collecting real defect samples for training industrial inspection systems.
Similar two-stage diversity-then-consistency training might transfer to generating other rare visual events such as medical lesions.
If the model truly generalizes across domains, it could reduce the need for domain-specific fine-tuning in broader image-editing applications.
Cross-testing the generated defects on additional industrial datasets would test whether the reported gains hold outside MVTec-AD and VisA.

Load-bearing premise

That defects produced by the two-stage training and MM-DiT fusion stay realistic, diverse, and category-consistent enough to improve downstream anomaly detection without adding artifacts or biases.

What would settle it

Train anomaly detectors on UniDG synthetics and measure whether they achieve lower or equal detection and localization scores than detectors trained on real defects or prior synthetic methods when evaluated on the same held-out MVTec-AD and VisA test sets.

Figures

Figures reproduced from arXiv: 2604.08915 by Bin-Bin Gao, Chengjie Wang, Jiawei Zhan, Jun Liu, Xiaochen Chen, Yuanting Fan, Yuhuan Lin, Zhewei Dai.

**Figure 1.** Figure 1: The framework of existing anomaly generation methods. The proposed UniDG maintains straightforward inference and fully open-sourced framework. Existing anomaly generation methods can be broadly grouped into two paradigms: (1) zero-shot approaches that edit normal images using textual descriptions with pretrained generative models, and (2) few-shot approaches that condition on a small set of real abnormal … view at source ↗

**Figure 2.** Figure 2: Overview of UDG Dataset. (a) The construction pipeline for generating normal-abnormal-mask-caption quadruplets. (b) The distribution across scenarios and the frequency of mapped defect categories. flow (CNF) formulation, the dynamics follow the ODE: d dtXt = v (Xt, t) dt = X1 − X0, ∀t ∈ [0, 1]. (2) Given a clean latent variable X0 ∼ pdata and a Gaussian noise X1 ∼ N (0, 1), we obtain Xt by linear interpola… view at source ↗

**Figure 3.** Figure 3: Overall framework of UniDG. UniDG leverages MM-DiT with a Defect-Context Editing strategy to integrate reference defects into target regions. Diversity-SFT and Consistency-RFT further improve synthesis quality. model defect foreground and background; however, realworld backgrounds exhibit substantial diversity, making faithful background reconstruction from a few samples intrinsically difficult and often… view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons of cross-object defect generation capabilities on MVTec-AD dataset [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of performance between UniDG-SFT and UniDG-RFT using Good-Same-Bad (GSB) evaluation. 6.4. Ablation Study In this section, we ablate key components of UniDG, analyze the impact of Consistency-RFT, and study how data allocation in Diversity-SFT affects performance. More ablation experiments are provided in Sec. E. Overall ablation. To evaluate the contribution of each component, we incrementally… view at source ↗

**Figure 7.** Figure 7: The system prompt for the Captioner Agent. You are an expert quality assessor for industrial defect detection and recognition systems. Your role is to evaluate the overall quality and accuracy of industrial image datasets, including normal images, defect images, defect masks, and their associated descriptions. ## TASK CONTEXT You will be provided with: - Normal Image: An image showing the object/surface wi… view at source ↗

**Figure 9.** Figure 9: The part visualizations of the abnormal-normal-mask-caption quadruplets of the UDG Dataset. categories into 28 standardized categories, listed below in descending order by frequency: missing, combined, deformation, discoloration, breakage, dirt, scratch, dehiscence, bruise, raised, foreign matter, abrasion, indentation, unknown, hole, black spot, fold, misprint, wrinkle, graze, watermark, scraped, glue, kn… view at source ↗

**Figure 10.** Figure 10: The quantitative comparison between the proposed UniDG-Text and existing proprietary advanced methods. retrieving a same-category reference image during inference. This zero-shot variant further improves the flexibility of UniDG’s image-insertion framework and enhances its generality for downstream applications. C. The details of the Diversity-SFT C.1. The details about the SFT Training Data Sampling In S… view at source ↗

**Figure 11.** Figure 11: The failure cases of UniDG-SFT. features (left side) and target-region features (right side). Given the standard scaled dot-product attention: Attention(Q, K, V) = softmax QK⊤ √ dk V (10) We construct a mask indicator set M containing the spatial indices of both the reference defect subjects (left side) and the target region (right side) in the latent space. An attention bias matrix B ∈ R N×N is then … view at source ↗

**Figure 12.** Figure 12: The system prompt for Defect-Und-Reward and MLLM-based comprehensive scores. of defect occurrence, and overall generation quality. Defect-Recog-Reward trains a universal defect classification and localization model using the large-scale UDG dataset; it identifies the mapped defect category in generated images and localizes the corresponding regions (i.e., defect pseudo masks). Defect-Und-Reward. Defect-Un… view at source ↗

**Figure 13.** Figure 13: The part visualizations of UDG-Reward-Bench [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference-based defect generation and text instruction-based defect editing without per-category fine-tuning. UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec-AD and VisA show that UniDG outperforms prior few-shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single- and multi-class anomaly detection/localization. Code will be available at https://github.com/RetoFan233/UniDG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniDG ships a 300K quadruplet dataset and a universal defect model that skips per-category fine-tuning and reports gains on MVTec-AD and VisA.

read the letter

The paper's core move is releasing UDG, a 300K-scale set of normal-abnormal-mask-caption quadruplets, paired with UniDG, a single model that does both reference-based defect insertion and text-driven editing. It avoids the few-shot overfitting the abstract flags by using Defect-Context Editing on diptych inputs, MM-DiT fusion for conditions, and a two-stage schedule that first pushes diversity then tightens realism and consistency. That combination is the actual new piece: prior work stayed small and category-specific, while this scales the data and trains once for broader use. The downstream results on single- and multi-class anomaly detection and localization on MVTec-AD and VisA are the practical test, and the abstract says the synthetics beat existing generation and editing baselines in quality and utility. The dataset release and the training recipe are the parts that could see reuse even if the model itself gets superseded. The approach looks coherent on paper: the diptych format and staged training directly target scale, morphology variation, and consistency problems that smaller methods hit. No obvious circularity or self-referential metrics show up in the description. The main soft spot is that we still lack the actual numbers, baseline details, statistical checks, and any ablation on whether the generated defects introduce artifacts that downstream detectors might exploit. Without those, it's hard to judge how large the real lift is or whether the gains hold when the test distributions shift. The dataset construction itself also needs scrutiny for domain coverage and labeling quality. This is aimed at groups doing industrial anomaly detection or synthetic data pipelines in computer vision. Anyone already working with MVTec or VisA, or who needs large paired defect data, will find the release and the two-stage recipe useful even if they adapt the model. It is worth sending for peer review because the problem is concrete, the benchmarks are standard, and the claims are falsifiable once the full experiments and code are examined.

Referee Report

2 major / 2 minor

Summary. The paper introduces UDG, a dataset of 300K normal-abnormal-mask-caption quadruplets across domains, and UniDG, a foundation model for universal defect generation supporting reference-based and text-instruction-based editing without per-category fine-tuning. The method uses Defect-Context Editing with adaptive cropping and diptych inputs, MM-DiT multimodal attention for condition fusion, and a two-stage training schedule (Diversity-SFT followed by Consistency-RFT). Experiments claim superior synthesis quality and improved single- and multi-class anomaly detection/localization performance over few-shot baselines on MVTec-AD and VisA.

Significance. If the empirical results hold under rigorous verification, this work could meaningfully advance industrial anomaly detection by providing scalable, generalizable synthetic defect data that reduces overfitting issues in few-shot approaches. The large-scale dataset and foundation-model framing with explicit two-stage training for diversity and consistency are notable strengths that could influence future data-generation pipelines.

major comments (2)

[§5] §5 (Experiments): The reported outperformance on MVTec-AD and VisA for both synthesis quality and downstream detection/localization lacks accompanying details on run counts, standard deviations, or statistical significance tests. Without these, it is difficult to assess whether the gains are robust or could be explained by evaluation variance or implementation differences in baselines.
[§4.2] §4.2 (MM-DiT fusion): The multimodal attention mechanism for fusing reference and target conditions is load-bearing for the claimed generalization, yet the manuscript provides insufficient architectural specifics (e.g., exact attention masking, conditioning injection points, or parameter counts) to allow reproduction or to verify that the fusion indeed preserves category consistency across scales and morphologies.

minor comments (2)

[Abstract] The abstract states that code will be available at a GitHub link, but the manuscript should explicitly confirm dataset release details (e.g., access method, licensing) to support the large-scale claim.
[§3] Figure captions and §3 (dataset construction) would benefit from additional quantitative statistics on defect scale/morphology distributions to substantiate the diversity assertions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation, the recommendation for minor revision, and the constructive comments on experimental rigor and architectural reproducibility. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [§5] §5 (Experiments): The reported outperformance on MVTec-AD and VisA for both synthesis quality and downstream detection/localization lacks accompanying details on run counts, standard deviations, or statistical significance tests. Without these, it is difficult to assess whether the gains are robust or could be explained by evaluation variance or implementation differences in baselines.

Authors: We agree that additional statistical details would strengthen the empirical claims. The reported results were obtained from single training and evaluation runs due to the substantial computational cost of training the foundation model. In the revised manuscript we will re-run the key experiments on MVTec-AD and VisA using three independent random seeds, report mean and standard deviation for all synthesis-quality and anomaly-detection metrics, and include paired t-tests (or Wilcoxon tests where appropriate) against the strongest baselines to establish statistical significance. revision: yes
Referee: [§4.2] §4.2 (MM-DiT fusion): The multimodal attention mechanism for fusing reference and target conditions is load-bearing for the claimed generalization, yet the manuscript provides insufficient architectural specifics (e.g., exact attention masking, conditioning injection points, or parameter counts) to allow reproduction or to verify that the fusion indeed preserves category consistency across scales and morphologies.

Authors: We acknowledge that the current description of MM-DiT is high-level. In the revision we will expand Section 4.2 with: (i) the precise attention-masking pattern used to prevent cross-contamination between reference and target tokens, (ii) the exact DiT block indices and layer types where reference and text embeddings are injected, (iii) the parameter counts of the multimodal attention modules, and (iv) a short ablation showing that the chosen fusion strategy maintains category consistency across defect scales. We will also release the full model configuration file together with the code. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is a new 300K quadruplet dataset (UDG) and a foundation model (UniDG) trained via Defect-Context Editing, MM-DiT fusion, and a two-stage Diversity-SFT + Consistency-RFT schedule. All performance claims are supported by direct empirical comparisons on external, standard benchmarks (MVTec-AD, VisA) against prior few-shot baselines, with no load-bearing equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central result to its own inputs by construction. The derivation chain remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily high-level and incomplete. The central claim rests on the effectiveness of the new dataset construction and the described architectural/training choices for universal generalization.

free parameters (1)

training hyperparameters for Diversity-SFT and Consistency-RFT stages
Two-stage fine-tuning process implies multiple tunable parameters whose specific values are not provided in the abstract.

axioms (2)

domain assumption Defect-Context Editing via adaptive cropping and diptych format preserves sufficient context for realistic and consistent defect insertion
Invoked as the core mechanism enabling reference-based editing without per-category adaptation.
domain assumption MM-DiT multimodal attention effectively fuses reference defect and target image conditions
Assumed to enable the universal (non-fine-tuned) behavior described.

pith-pipeline@v0.9.0 · 5524 in / 1453 out tokens · 47630 ms · 2026-05-10T17:45:04.293510+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We adopt a two-stage training strategy for universal defect generation: Diversity-SFT followed by Consistency-RFT... construct two reward models (Defect-Und-Reward and Defect-Recog-Reward) and build an online RFT pipeline... based on Flow-GRPO

Reference graph

Works this paper leans on

18 extracted references · 10 canonical work pages · 7 internal anchors

[1]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., and etal. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Huang, S., Hou, Z., Jiang, D., Jin, X., Li, L., et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699,

work page internal anchor Pith review arXiv
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y ., Atzmon, Y ., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618,

work page internal anchor Pith review arXiv
[5]

Freeedit: Mask-free reference- based image editing with multi-modal instruction.arXiv preprint arXiv:2409.18071,

He, R., Ma, K., Huang, L., Huang, S., Gao, J., Wei, X., Dai, J., Han, J., and Liu, S. Freeedit: Mask-free reference- based image editing with multi-modal instruction.arXiv preprint arXiv:2409.18071,

work page arXiv
[6]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Mishchenko and A

9 Universal Defect Generation Mishchenko, K. and Defazio, A. Prodigy: An expedi- tiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101,

work page arXiv
[8]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V ., Hu, Y .-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R ¨adle, R., Rolland, C., Gustafson, L., et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Defectfill: Realistic defect generation with inpainting diffusion model for visual inspection

Song, J., Park, D., Baek, K., Lee, S., Choi, J., Kim, E., and Yoon, S. Defectfill: Realistic defect generation with inpainting diffusion model for visual inspection. InCVPR, 2025a. Song, W., Jiang, H., Yang, Z., Quan, R., and Yang, Y . Insert anything: Image insertion via in-context editing in dit. In AAAI, 2025b. Tan, X., Liu, J., Fan, Y ., Gao, B.-B., J...

work page arXiv
[10]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., and Yang, W. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review arXiv
[11]

For abnormal images, we ensure that the randomly sampled masks do not overlap with the annotated abnormal regions, and otherwise apply the same data construction procedure

to extract foreground regions and apply a pre-constructed mask template library to sample diverse masks (1–3 masks) for training. For abnormal images, we ensure that the randomly sampled masks do not overlap with the annotated abnormal regions, and otherwise apply the same data construction procedure. We follow the same training setup as UniDG (optimizer,...

2026
[12]

The V AE encoder compresses the dimensionality of MM-DiT latent features to reduce computational cost, following the design of LDM (Rombach et al., 2022)

MM-DiT. The V AE encoder compresses the dimensionality of MM-DiT latent features to reduce computational cost, following the design of LDM (Rombach et al., 2022). The CLIP and T5-XXL text encoders are used to encode text instructions; they are not activated for reference-based defect synthesis, and are only required for the instruction-based defect synthe...

2022
[13]

Analysis

Overall Quality (Score: 0.0 -5.0)Evaluate the holistic visual quality of the final target image. Consider:-**Realism**: Does the final image look like a genuine photograph?-**Visual Fidelity**: Is the image sharp, clear, and free from artifacts?-**Professional Appearance**: Would this result be suitable for industrial quality assessment?-**Composite Score...

2025
[14]

Defect-Und-Reward achieves performance comparable to proprietary models on UDG-Reward-Bench, demonstrating its effectiveness and robustness

An advanced MLLM (i.e., Gemini-3-Pro (Comanici et al., 2025)) then summarizes the annotations for each pair by discarding the highest and lowest scores. Defect-Und-Reward achieves performance comparable to proprietary models on UDG-Reward-Bench, demonstrating its effectiveness and robustness. Defect-Recog-Reward.Defect-Recog-Reward trains a universal defe...

2025
[15]

Higher scores are assigned when the predicted masks better match the ground-truth masks

The instance segmentation model is built on ViT-B/16 and predicts defect instance masks for input images. Higher scores are assigned when the predicted masks better match the ground-truth masks. The scoring is based on three metrics: pixel AUROC, pixel AP, and pixel PRO. We normalize these metrics, average them with the category score, and use the resulti...

2025
[16]

We adopt Group Relative Policy Optimization (GRPO) to optimize the generation model using reward signals

Partially inspired by Consistent-RFT (Tan et al., 2026). We adopt Group Relative Policy Optimization (GRPO) to optimize the generation model using reward signals. For each defect generation task, we sample a group of G generated images {I(i) tar }G i=1 and compute their rewards {r(i)}G i=1. Note that ˜Iref means the reference defect subject features from ...

2026
[17]

Beyond MLLM-based approaches, these contextual signals may also benefit vision-language alignment-based anomaly detection methods (Gao et al., 2026)

These quadruplets explicitly provide defect-region cues and aligned textual descriptions, leading to improved performance for Qwen2.5-VL-7B-Instruct, which becomes comparable to Qwen2.5-VL-72B-Instruct without quadruplet context. Beyond MLLM-based approaches, these contextual signals may also benefit vision-language alignment-based anomaly detection metho...

2026
[18]

It can be noticed that UniDG (reference image-based) is not significantly slower than other FSAG or Image Insertion methods, as it adopts the Rectified Flow 21 Universal Defect Generation Table 9.The quadruplets ofUDGimproves MLLM-based anomaly classification and detection methods. Model Scale Anomaly Defect Object Average Discrimination Classification Lo...

2022