pith. machine review for the scientific record. sign in

arxiv: 2604.08915 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Large-Scale Universal Defect Generation: Foundation Models and Datasets

Bin-Bin Gao, Chengjie Wang, Jiawei Zhan, Jun Liu, Xiaochen Chen, Yuanting Fan, Yuhuan Lin, Zhewei Dai

Pith reviewed 2026-05-10 17:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords defect generationanomaly detectionfoundation modelimage editingsynthetic datasetMVTec-ADuniversal modelmultimodal diffusion
0
0 comments X

The pith

UniDG uses a 300K-pair dataset and MM-DiT fusion to generate realistic defects from references or text instructions without per-category fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UDG, a dataset of 300,000 normal-abnormal-mask-caption quadruplets across domains, and UniDG, a foundation model for universal defect generation. UniDG performs reference-based defect insertion or text-guided editing through adaptive cropping, diptych formatting, and multimodal attention fusion. A two-stage process first maximizes diversity then enforces consistency and realism. If the claim holds, synthetic defects can train anomaly detectors that generalize better to real industrial images on benchmarks like MVTec-AD and VisA, reducing reliance on scarce real defect examples.

Core claim

UniDG is a universal defect generation model that supports both reference-based generation and text instruction-based editing without per-category fine-tuning. It achieves this via Defect-Context Editing with adaptive defect cropping and structured diptych inputs, fuses reference and target conditions through MM-DiT multimodal attention, and applies a two-stage training strategy of Diversity-SFT followed by Consistency-RFT. Experiments show it outperforms prior few-shot anomaly generation and image insertion baselines in synthesis quality and in single- and multi-class anomaly detection and localization on MVTec-AD and VisA.

What carries the argument

Defect-Context Editing mechanism that uses adaptive cropping and diptych input format, combined with MM-DiT multimodal attention for fusing reference and target conditions, trained in two stages on the UDG dataset.

If this is right

  • Enables single-class and multi-class anomaly detection and localization without retraining per defect category.
  • Improves image synthesis quality over existing few-shot generation and editing methods.
  • Supports both copying defects from reference images and following natural-language editing instructions.
  • Handles wide variation in defect scale and morphology while preserving category consistency.
  • Provides a large paired dataset that can serve as a foundation for further defect-related model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could lower the cost of collecting real defect samples for training industrial inspection systems.
  • Similar two-stage diversity-then-consistency training might transfer to generating other rare visual events such as medical lesions.
  • If the model truly generalizes across domains, it could reduce the need for domain-specific fine-tuning in broader image-editing applications.
  • Cross-testing the generated defects on additional industrial datasets would test whether the reported gains hold outside MVTec-AD and VisA.

Load-bearing premise

That defects produced by the two-stage training and MM-DiT fusion stay realistic, diverse, and category-consistent enough to improve downstream anomaly detection without adding artifacts or biases.

What would settle it

Train anomaly detectors on UniDG synthetics and measure whether they achieve lower or equal detection and localization scores than detectors trained on real defects or prior synthetic methods when evaluated on the same held-out MVTec-AD and VisA test sets.

Figures

Figures reproduced from arXiv: 2604.08915 by Bin-Bin Gao, Chengjie Wang, Jiawei Zhan, Jun Liu, Xiaochen Chen, Yuanting Fan, Yuhuan Lin, Zhewei Dai.

Figure 1
Figure 1. Figure 1: The framework of existing anomaly generation methods. The proposed UniDG maintains straightforward inference and fully open-sourced framework. Existing anomaly generation methods can be broadly grouped into two paradigms: (1) zero-shot approaches that edit normal images using textual descriptions with pre￾trained generative models, and (2) few-shot approaches that condition on a small set of real abnormal … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of UDG Dataset. (a) The construction pipeline for generating normal-abnormal-mask-caption quadruplets. (b) The distribution across scenarios and the frequency of mapped defect categories. flow (CNF) formulation, the dynamics follow the ODE: d dtXt = v (Xt, t) dt = X1 − X0, ∀t ∈ [0, 1]. (2) Given a clean latent variable X0 ∼ pdata and a Gaussian noise X1 ∼ N (0, 1), we obtain Xt by linear interpola… view at source ↗
Figure 3
Figure 3. Figure 3: Overall framework of UniDG. UniDG leverages MM-DiT with a Defect-Context Editing strategy to integrate reference defects into target regions. Diversity-SFT and Consistency-RFT further improve synthesis quality. model defect foreground and background; however, real￾world backgrounds exhibit substantial diversity, making faithful background reconstruction from a few samples in￾trinsically difficult and often… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons of cross-object defect generation capabilities on MVTec-AD dataset [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of performance between UniDG-SFT and UniDG-RFT using Good-Same-Bad (GSB) evaluation. 6.4. Ablation Study In this section, we ablate key components of UniDG, an￾alyze the impact of Consistency-RFT, and study how data allocation in Diversity-SFT affects performance. More abla￾tion experiments are provided in Sec. E. Overall ablation. To evaluate the contribution of each component, we incrementally… view at source ↗
Figure 7
Figure 7. Figure 7: The system prompt for the Captioner Agent. You are an expert quality assessor for industrial defect detection and recognition systems. Your role is to evaluate the overall quality and accuracy of industrial image datasets, including normal images, defect images, defect masks, and their associated descriptions. ## TASK CONTEXT You will be provided with: - Normal Image: An image showing the object/surface wi… view at source ↗
Figure 9
Figure 9. Figure 9: The part visualizations of the abnormal-normal-mask-caption quadruplets of the UDG Dataset. categories into 28 standardized categories, listed below in descending order by frequency: missing, combined, deformation, discoloration, breakage, dirt, scratch, dehiscence, bruise, raised, foreign matter, abrasion, indentation, unknown, hole, black spot, fold, misprint, wrinkle, graze, watermark, scraped, glue, kn… view at source ↗
Figure 10
Figure 10. Figure 10: The quantitative comparison between the proposed UniDG-Text and existing proprietary advanced methods. retrieving a same-category reference image during inference. This zero-shot variant further improves the flexibility of UniDG’s image-insertion framework and enhances its generality for downstream applications. C. The details of the Diversity-SFT C.1. The details about the SFT Training Data Sampling In S… view at source ↗
Figure 11
Figure 11. Figure 11: The failure cases of UniDG-SFT. features (left side) and target-region features (right side). Given the standard scaled dot-product attention: Attention(Q, K, V) = softmax  QK⊤ √ dk  V (10) We construct a mask indicator set M containing the spatial indices of both the reference defect subjects (left side) and the target region (right side) in the latent space. An attention bias matrix B ∈ R N×N is then … view at source ↗
Figure 12
Figure 12. Figure 12: The system prompt for Defect-Und-Reward and MLLM-based comprehensive scores. of defect occurrence, and overall generation quality. Defect-Recog-Reward trains a universal defect classification and localization model using the large-scale UDG dataset; it identifies the mapped defect category in generated images and localizes the corresponding regions (i.e., defect pseudo masks). Defect-Und-Reward. Defect-Un… view at source ↗
Figure 13
Figure 13. Figure 13: The part visualizations of UDG-Reward-Bench [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference-based defect generation and text instruction-based defect editing without per-category fine-tuning. UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec-AD and VisA show that UniDG outperforms prior few-shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single- and multi-class anomaly detection/localization. Code will be available at https://github.com/RetoFan233/UniDG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces UDG, a dataset of 300K normal-abnormal-mask-caption quadruplets across domains, and UniDG, a foundation model for universal defect generation supporting reference-based and text-instruction-based editing without per-category fine-tuning. The method uses Defect-Context Editing with adaptive cropping and diptych inputs, MM-DiT multimodal attention for condition fusion, and a two-stage training schedule (Diversity-SFT followed by Consistency-RFT). Experiments claim superior synthesis quality and improved single- and multi-class anomaly detection/localization performance over few-shot baselines on MVTec-AD and VisA.

Significance. If the empirical results hold under rigorous verification, this work could meaningfully advance industrial anomaly detection by providing scalable, generalizable synthetic defect data that reduces overfitting issues in few-shot approaches. The large-scale dataset and foundation-model framing with explicit two-stage training for diversity and consistency are notable strengths that could influence future data-generation pipelines.

major comments (2)
  1. [§5] §5 (Experiments): The reported outperformance on MVTec-AD and VisA for both synthesis quality and downstream detection/localization lacks accompanying details on run counts, standard deviations, or statistical significance tests. Without these, it is difficult to assess whether the gains are robust or could be explained by evaluation variance or implementation differences in baselines.
  2. [§4.2] §4.2 (MM-DiT fusion): The multimodal attention mechanism for fusing reference and target conditions is load-bearing for the claimed generalization, yet the manuscript provides insufficient architectural specifics (e.g., exact attention masking, conditioning injection points, or parameter counts) to allow reproduction or to verify that the fusion indeed preserves category consistency across scales and morphologies.
minor comments (2)
  1. [Abstract] The abstract states that code will be available at a GitHub link, but the manuscript should explicitly confirm dataset release details (e.g., access method, licensing) to support the large-scale claim.
  2. [§3] Figure captions and §3 (dataset construction) would benefit from additional quantitative statistics on defect scale/morphology distributions to substantiate the diversity assertions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation, the recommendation for minor revision, and the constructive comments on experimental rigor and architectural reproducibility. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (Experiments): The reported outperformance on MVTec-AD and VisA for both synthesis quality and downstream detection/localization lacks accompanying details on run counts, standard deviations, or statistical significance tests. Without these, it is difficult to assess whether the gains are robust or could be explained by evaluation variance or implementation differences in baselines.

    Authors: We agree that additional statistical details would strengthen the empirical claims. The reported results were obtained from single training and evaluation runs due to the substantial computational cost of training the foundation model. In the revised manuscript we will re-run the key experiments on MVTec-AD and VisA using three independent random seeds, report mean and standard deviation for all synthesis-quality and anomaly-detection metrics, and include paired t-tests (or Wilcoxon tests where appropriate) against the strongest baselines to establish statistical significance. revision: yes

  2. Referee: [§4.2] §4.2 (MM-DiT fusion): The multimodal attention mechanism for fusing reference and target conditions is load-bearing for the claimed generalization, yet the manuscript provides insufficient architectural specifics (e.g., exact attention masking, conditioning injection points, or parameter counts) to allow reproduction or to verify that the fusion indeed preserves category consistency across scales and morphologies.

    Authors: We acknowledge that the current description of MM-DiT is high-level. In the revision we will expand Section 4.2 with: (i) the precise attention-masking pattern used to prevent cross-contamination between reference and target tokens, (ii) the exact DiT block indices and layer types where reference and text embeddings are injected, (iii) the parameter counts of the multimodal attention modules, and (iv) a short ablation showing that the chosen fusion strategy maintains category consistency across defect scales. We will also release the full model configuration file together with the code. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is a new 300K quadruplet dataset (UDG) and a foundation model (UniDG) trained via Defect-Context Editing, MM-DiT fusion, and a two-stage Diversity-SFT + Consistency-RFT schedule. All performance claims are supported by direct empirical comparisons on external, standard benchmarks (MVTec-AD, VisA) against prior few-shot baselines, with no load-bearing equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central result to its own inputs by construction. The derivation chain remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily high-level and incomplete. The central claim rests on the effectiveness of the new dataset construction and the described architectural/training choices for universal generalization.

free parameters (1)
  • training hyperparameters for Diversity-SFT and Consistency-RFT stages
    Two-stage fine-tuning process implies multiple tunable parameters whose specific values are not provided in the abstract.
axioms (2)
  • domain assumption Defect-Context Editing via adaptive cropping and diptych format preserves sufficient context for realistic and consistent defect insertion
    Invoked as the core mechanism enabling reference-based editing without per-category adaptation.
  • domain assumption MM-DiT multimodal attention effectively fuses reference defect and target image conditions
    Assumed to enable the universal (non-fine-tuned) behavior described.

pith-pipeline@v0.9.0 · 5524 in / 1453 out tokens · 47630 ms · 2026-05-10T17:45:04.293510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

18 extracted references · 10 canonical work pages · 7 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y ., Chen, R., and etal. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  2. [2]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Huang, S., Hou, Z., Jiang, D., Jin, X., Li, L., et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699,

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  4. [4]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Gal, R., Alaluf, Y ., Atzmon, Y ., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618,

  5. [5]

    Freeedit: Mask-free reference- based image editing with multi-modal instruction.arXiv preprint arXiv:2409.18071,

    He, R., Ma, K., Huang, L., Huang, S., Gao, J., Wei, X., Dai, J., Han, J., and Liu, S. Freeedit: Mask-free reference- based image editing with multi-modal instruction.arXiv preprint arXiv:2409.18071,

  6. [6]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,

  7. [7]

    Mishchenko and A

    9 Universal Defect Generation Mishchenko, K. and Defazio, A. Prodigy: An expedi- tiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101,

  8. [8]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V ., Hu, Y .-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R ¨adle, R., Rolland, C., Gustafson, L., et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

  9. [9]

    Defectfill: Realistic defect generation with inpainting diffusion model for visual inspection

    Song, J., Park, D., Baek, K., Lee, S., Choi, J., Kim, E., and Yoon, S. Defectfill: Realistic defect generation with inpainting diffusion model for visual inspection. InCVPR, 2025a. Song, W., Jiang, H., Yang, Z., Quan, R., and Yang, Y . Insert anything: Image insertion via in-context editing in dit. In AAAI, 2025b. Tan, X., Liu, J., Fan, Y ., Gao, B.-B., J...

  10. [10]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., and Yang, W. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,

  11. [11]

    For abnormal images, we ensure that the randomly sampled masks do not overlap with the annotated abnormal regions, and otherwise apply the same data construction procedure

    to extract foreground regions and apply a pre-constructed mask template library to sample diverse masks (1–3 masks) for training. For abnormal images, we ensure that the randomly sampled masks do not overlap with the annotated abnormal regions, and otherwise apply the same data construction procedure. We follow the same training setup as UniDG (optimizer,...

  12. [12]

    The V AE encoder compresses the dimensionality of MM-DiT latent features to reduce computational cost, following the design of LDM (Rombach et al., 2022)

    MM-DiT. The V AE encoder compresses the dimensionality of MM-DiT latent features to reduce computational cost, following the design of LDM (Rombach et al., 2022). The CLIP and T5-XXL text encoders are used to encode text instructions; they are not activated for reference-based defect synthesis, and are only required for the instruction-based defect synthe...

  13. [13]

    Analysis

    Overall Quality (Score: 0.0 -5.0)Evaluate the holistic visual quality of the final target image. Consider:-**Realism**: Does the final image look like a genuine photograph?-**Visual Fidelity**: Is the image sharp, clear, and free from artifacts?-**Professional Appearance**: Would this result be suitable for industrial quality assessment?-**Composite Score...

  14. [14]

    Defect-Und-Reward achieves performance comparable to proprietary models on UDG-Reward-Bench, demonstrating its effectiveness and robustness

    An advanced MLLM (i.e., Gemini-3-Pro (Comanici et al., 2025)) then summarizes the annotations for each pair by discarding the highest and lowest scores. Defect-Und-Reward achieves performance comparable to proprietary models on UDG-Reward-Bench, demonstrating its effectiveness and robustness. Defect-Recog-Reward.Defect-Recog-Reward trains a universal defe...

  15. [15]

    Higher scores are assigned when the predicted masks better match the ground-truth masks

    The instance segmentation model is built on ViT-B/16 and predicts defect instance masks for input images. Higher scores are assigned when the predicted masks better match the ground-truth masks. The scoring is based on three metrics: pixel AUROC, pixel AP, and pixel PRO. We normalize these metrics, average them with the category score, and use the resulti...

  16. [16]

    We adopt Group Relative Policy Optimization (GRPO) to optimize the generation model using reward signals

    Partially inspired by Consistent-RFT (Tan et al., 2026). We adopt Group Relative Policy Optimization (GRPO) to optimize the generation model using reward signals. For each defect generation task, we sample a group of G generated images {I(i) tar }G i=1 and compute their rewards {r(i)}G i=1. Note that ˜Iref means the reference defect subject features from ...

  17. [17]

    Beyond MLLM-based approaches, these contextual signals may also benefit vision-language alignment-based anomaly detection methods (Gao et al., 2026)

    These quadruplets explicitly provide defect-region cues and aligned textual descriptions, leading to improved performance for Qwen2.5-VL-7B-Instruct, which becomes comparable to Qwen2.5-VL-72B-Instruct without quadruplet context. Beyond MLLM-based approaches, these contextual signals may also benefit vision-language alignment-based anomaly detection metho...

  18. [18]

    It can be noticed that UniDG (reference image-based) is not significantly slower than other FSAG or Image Insertion methods, as it adopts the Rectified Flow 21 Universal Defect Generation Table 9.The quadruplets ofUDGimproves MLLM-based anomaly classification and detection methods. Model Scale Anomaly Defect Object Average Discrimination Classification Lo...