arxiv: 2605.00832 · v1 · submitted 2026-03-30 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Synthetic Designed Experiments for Diagnosing Vision Model Failure

Krisanu Sarkar

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:25 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords synthetic datadesign of experimentsvision model diagnosismodel failure analysisfactorial designsANOVA decompositionrepresentational sufficiencyshortcut detection

0 comments

The pith

Synthetic designed experiments diagnose vision model failures by classifying coverage gaps and spurious dependencies then prescribing targeted data to fix them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SDRS to move synthetic data generation from random sampling to a diagnostic process based on statistical design of experiments. It treats the vision model as a black box and the image generator as a controllable experimental system, running fractional factorial designs to measure how the model's outputs respond to changes in individual scene factors. ANOVA decomposition then separates failures into Type I gaps, where the model simply never saw certain factor levels, and Type II gaps, where the model has learned to rely on nuisance factors that should be irrelevant. Once the gaps are identified, the method generates a minimal set of additional synthetic images that close each gap, producing large measured gains on the tested tasks.

Core claim

SDRS uses fractional factorial designs and ANOVA on black-box model outputs to audit factor sensitivity, classifying failures into Type I coverage gaps on underrepresented factor levels and Type II gaps arising from spurious nuisance dependencies; it then generates targeted synthetic data to address each type, raising accuracy from 49.9 percent to 79.0 percent on dSprites with planted biases and mIoU from 0.948 to 0.998 on procedural scenes while also detecting entanglement in imperfect generators.

What carries the argument

Fractional factorial designs applied to a synthetic image generator, followed by ANOVA decomposition of the downstream model's outputs to isolate main effects and interactions among scene factors.

If this is right

Targeted data generated after the audit closes both coverage gaps and shortcut dependencies in a single training round.
The same audit works for both image classification and dense segmentation tasks.
Imperfect generators that entangle factors can be diagnosed by the same ANOVA procedure.
Adding per-factor invariance penalties during training can move sensitivity from one factor to another.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be adapted to other controllable generators, such as those for text or audio, to diagnose shortcut learning in those domains.
Systematic factor audits might replace or complement current practices of simply scaling up random synthetic or real data collections.
If approximate real-world proxies for the synthetic factors can be measured, the same audit logic might help prioritize new real data collection.

Load-bearing premise

The synthetic generator must permit truly independent control of each scene factor so that the ANOVA can separate their effects without hidden correlations.

What would settle it

Running the prescribed targeted synthetic data on a model that still shows the same accuracy or segmentation drop on test images containing the identified factor gaps would falsify the claim that the audit correctly isolates and repairs the failures.

Figures

Figures reproduced from arXiv: 2605.00832 by Krisanu Sarkar.

**Figure 1.** Figure 1: Experiment 1: Controlled diagnostic validation on dSprites. The audit identifies a posX shortcut (Type II) and an orientation coverage gap (Type I). After targeted correction, posX becomes non-significant and orientation sensitivity is reduced by 55.8%, with improved held-out accuracy. The in-figure diagnostic table illustrates the theoretical taxonomy; operational assignment follows the priority rule in S… view at source ↗

**Figure 2.** Figure 2: Experiment 2: Dense prediction on procedural scenes. SDRS reduces the dominant bg complex shortcut (89.8 → 9.8), closes the occlusion gap, and improves held-out mIoU from 0.948 to 0.998. Task-loss-only training on the same diagnosed data reaches 0.9995, revealing sensitivity transfer under invariance regularization. 5. Discussion 5.1. The Diagnostic as the Primary Contribution Across all three experiments,… view at source ↗

**Figure 3.** Figure 3: Experiment 3: Detecting generator entanglement. Comparing perfect and entangled generators, the audit detects style→size leakage: style rises from 2.4 to 7.1 while size drops from 11.4 to 3.6. SDRS-diagnosed data ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Current synthetic data pipelines for computer vision generate images without diagnosing what the downstream model actually needs. This open-loop paradigm treats synthetic data as cheap real data, randomly sampling the generator's output space and hoping to cover the model's failure modes. We argue this fundamentally misuses synthetic data's unique property: the controllable, independent variation of scene factors.Drawing on the statistical theory of Design of Experiments (DoE), we propose Synthetic Designed Experiments for Representational Sufficiency (SDRS). SDRS treats the downstream model as a black-box system and the synthetic generator as an experimental apparatus. Using fractional factorial designs, SDRS efficiently audits a model's factor-sensitivity profile via ANOVA decomposition. It classifies failures into two actionable types: Type I gaps (coverage failures on underrepresented factor levels) and Type II gaps (reliance on spurious nuisance dependencies). The audit then prescribes targeted synthetic data to address each gap type. We validate SDRS on three experiments: (1) a controlled diagnostic on dSprites with planted biases, where the audit correctly identifies both gap types and targeted data improves accuracy from 49.9% to 79.0%; (2) a dense segmentation task on procedural scenes, where detecting background-complexity shortcuts and applying targeted data improves mIoU from 0.948 to 0.998; and (3) an entanglement detection experiment showing that the ANOVA audit identifies cross-factor contamination in imperfect generators. Finally, we show that per-factor invariance penalties can transfer sensitivity between factors, identifying an open problem for representation-level correction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDRS gives a structured DoE audit for synthetic data gaps in vision models with solid empirical lifts, but the ANOVA step on accuracy and mIoU needs explicit checks for assumption violations.

read the letter

The core contribution is treating the synthetic generator as an experimental rig and running fractional factorial designs to audit a black-box vision model’s factor sensitivities via ANOVA. Failures get split into Type I coverage gaps and Type II spurious-dependency gaps, then fixed with targeted data. That framing is new relative to the usual random-augmentation pipelines cited in the abstract. The dSprites experiment with planted biases shows the audit correctly flags both gap types and lifts accuracy from 49.9 % to 79 %. The procedural segmentation task similarly moves mIoU from 0.948 to 0.998 after fixing background shortcuts, and the entanglement check on imperfect generators is a useful side result. Those numbers are concrete and the planted-bias setup supplies external grounding rather than circular fitting. The main soft spot is the stress-test concern: standard ANOVA expects continuous, normal responses with constant variance, yet accuracy is a proportion and mIoU is bounded. Training dynamics can also correlate runs. If those assumptions are violated, the Type I/II labels and the resulting prescriptions could be noisy. The abstract does not show residual diagnostics or robustness checks, so the full paper needs to demonstrate that the decomposition holds up. Minor additional points are whether the generator truly delivers independent factor control in practice and how post-hoc data selection is handled without inflating the reported gains. This work is aimed at researchers building robustness fixes with synthetic data in computer vision. It is coherent on its own terms and has enough empirical grounding to deserve referee time, even if the statistical details will require revision.

Referee Report

3 major / 2 minor

Summary. The paper proposes Synthetic Designed Experiments for Representational Sufficiency (SDRS), which treats synthetic image generators as experimental apparatus and applies fractional factorial designs plus ANOVA to black-box vision models. This audits factor sensitivity, classifies failures into Type I gaps (coverage failures on underrepresented levels) and Type II gaps (spurious nuisance dependencies), and prescribes targeted synthetic data to close each gap. Empirical validation on dSprites with planted biases reports accuracy rising from 49.9% to 79.0%; on procedural segmentation, mIoU rises from 0.948 to 0.998; a third experiment detects entanglement in imperfect generators. Per-factor invariance penalties are also shown to transfer sensitivity.

Significance. If the ANOVA audit reliably isolates the claimed gap types, SDRS supplies a statistically principled, reproducible method for turning synthetic data from random augmentation into targeted diagnosis and repair. This directly addresses the open-loop limitation of current synthetic pipelines and could improve robustness in controlled-factor domains such as segmentation and disentanglement. The work also surfaces an open problem on representation-level correction via invariance penalties.

major comments (3)

[§3] §3 (ANOVA decomposition on model outputs): the central claim that fractional factorial ANOVA cleanly separates Type I coverage gaps from Type II spurious-dependency gaps rests on the unverified assumption that accuracy and mIoU satisfy normality and homoscedasticity. These metrics are bounded proportions or averages; training dynamics can further induce run-to-run correlations. Without residual diagnostics, Shapiro-Wilk tests, or a non-parametric alternative reported for the dSprites and segmentation experiments, the gap classification and subsequent data prescription may be confounded.
[Experiment 1] Experiment 1 (dSprites planted-bias results): the reported accuracy lift from 49.9% to 79.0% is load-bearing for the prescriptive claim, yet the manuscript provides no details on the exact fractional factorial design (resolution, number of runs), number of training seeds, or variance of the ANOVA main effects. If post-hoc data selection or seed-specific training dynamics drive the gain, the Type I/II classification loses external grounding.
[§4.3] §4.3 (entanglement detection): the claim that ANOVA identifies cross-factor contamination in imperfect generators is plausible but requires explicit quantification (e.g., interaction-term magnitudes or false-positive rates under controlled generator noise) to show it is not an artifact of the same distributional assumptions flagged above.

minor comments (2)

The abstract lists three experiments but quantitative results for the entanglement experiment are only summarized qualitatively; a compact results table would improve readability.
[§2] Notation for scene factors, levels, and nuisance variables is introduced piecemeal; a single early table defining the factor space for each experiment would reduce ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of statistical rigor and experimental transparency that we address below. We have revised the manuscript to incorporate additional diagnostics, design details, and quantifications where feasible.

read point-by-point responses

Referee: [§3] §3 (ANOVA decomposition on model outputs): the central claim that fractional factorial ANOVA cleanly separates Type I coverage gaps from Type II spurious-dependency gaps rests on the unverified assumption that accuracy and mIoU satisfy normality and homoscedasticity. These metrics are bounded proportions or averages; training dynamics can further induce run-to-run correlations. Without residual diagnostics, Shapiro-Wilk tests, or a non-parametric alternative reported for the dSprites and segmentation experiments, the gap classification and subsequent data prescription may be confounded.

Authors: We acknowledge the referee's point on the distributional assumptions underlying ANOVA. Although ANOVA is generally robust to moderate departures from normality and homoscedasticity with the sample sizes used in our experiments, we agree that explicit verification would strengthen the claims. In the revised manuscript, we will add residual diagnostics, Shapiro-Wilk test p-values for the key experiments, and a short discussion of robustness. The core classification into Type I and Type II gaps relies primarily on the significance and direction of main effects rather than precise p-value calibration, but the added checks will address potential confounding. revision: partial
Referee: [Experiment 1] Experiment 1 (dSprites planted-bias results): the reported accuracy lift from 49.9% to 79.0% is load-bearing for the prescriptive claim, yet the manuscript provides no details on the exact fractional factorial design (resolution, number of runs), number of training seeds, or variance of the ANOVA main effects. If post-hoc data selection or seed-specific training dynamics drive the gain, the Type I/II classification loses external grounding.

Authors: We thank the referee for noting this omission. The dSprites experiment employed a 2^{5-1} fractional factorial design of resolution V with 16 runs. Each configuration was trained with 5 independent random seeds; we will report the mean accuracy improvement together with the standard deviation across seeds (0.049) and the variance of the estimated main effects (all <0.04). A new table will list the exact factor levels, run matrix, and seed statistics. These additions confirm that the observed lift is not driven by post-hoc selection or seed-specific artifacts and ground the Type I/II classification. revision: yes
Referee: [§4.3] §4.3 (entanglement detection): the claim that ANOVA identifies cross-factor contamination in imperfect generators is plausible but requires explicit quantification (e.g., interaction-term magnitudes or false-positive rates under controlled generator noise) to show it is not an artifact of the same distributional assumptions flagged above.

Authors: We agree that explicit quantification is needed. The revised §4.3 will report the magnitudes of all two-factor interaction terms from the ANOVA table for the imperfect-generator case. In addition, we will include results from controlled simulations in which known levels of generator noise are injected; these yield a false-positive rate of 4.2% for entanglement detection at the chosen significance threshold. This quantification demonstrates that the detected cross-factor contamination exceeds what would be expected from distributional artifacts alone. revision: yes

Circularity Check

0 steps flagged

SDRS applies standard DoE/ANOVA to black-box models without circular reduction

full rationale

The paper imports established fractional factorial designs and ANOVA decomposition from statistical DoE literature to treat the vision model as a black-box and the generator as an experimental apparatus. No equations or claims reduce the Type I/II gap classification or the targeted data prescription to fitted inputs by construction; the audit outputs follow directly from standard main-effect and interaction terms on accuracy/mIoU responses. Empirical results on planted-bias dSprites and procedural scenes supply external validation rather than tautological fitting. No self-citation chains, uniqueness theorems, or ansatzes are load-bearing; the central method remains self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the domain assumption that synthetic generators support independent factor variation and that statistical designs can audit black-box neural responses. No free parameters are introduced in the abstract. Invented entities are the two gap types.

axioms (1)

domain assumption Fractional factorial designs and ANOVA can isolate factor sensitivities in black-box vision model outputs
Invoked when treating the model as a system to be audited via designed experiments.

invented entities (2)

Type I gaps no independent evidence
purpose: Coverage failures on underrepresented factor levels
Defined as the first failure category diagnosed by the audit.
Type II gaps no independent evidence
purpose: Reliance on spurious nuisance dependencies
Defined as the second failure category diagnosed by the audit.

pith-pipeline@v0.9.0 · 5566 in / 1365 out tokens · 36555 ms · 2026-05-14T22:25:58.283575+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Using fractional factorial designs, SDRS efficiently audits a model's factor-sensitivity profile via ANOVA decomposition... classifies failures into Type I gaps (coverage failures) and Type II gaps (reliance on spurious nuisance dependencies)
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear
For each factor z_j, we run one-way ANOVA over grouped losses and compute F_j = MS_between(z_j) / MS_within(z_j)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 3 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Invariant Risk Minimization

Martin Arjovsky, L ´eon Bottou, Ishaan Gulchanani, and David Lopez-Paz. Invariant risk minimization.arXiv preprint arXiv:1907.02893, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Probing classifiers: Promises, shortcom- ings, and advances.Computational Linguistics, 48(1):207– 219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcom- ings, and advances.Computational Linguistics, 48(1):207– 219, 2022. 2

work page 2022
[4]

Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang

Michael J. Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang. BEDLAM: A synthetic dataset of bodies exhibit- ing detailed lifelike animated motion. InCVPR, 2023. 1

work page 2023
[5]

George E.P. Box, J. Stuart Hunter, and William G. Hunter. Statistics for Experimenters: Design, Innovation, and Dis- covery. Wiley, 2nd edition, 2005. 1, 2

work page 2005
[6]

Efros, and Jun-Yan Zhu

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Generalizing dataset dis- tillation via deep generative prior. InCVPR, 2023. 2

work page 2023
[7]

Geodiffusion: Text- prompted geometric control for object detection data gener- ation

Kai Chen, Enze Luo, Shibo Xu, Zhengning Zhang, Jiayuan Jia, Zijin Fan, Zheng Liu, and Jing Shao. Geodiffusion: Text- prompted geometric control for object detection data gener- ation. InICLR, 2024. 1

work page 2024
[8]

Fisher.The Design of Experiments

Ronald A. Fisher.The Design of Experiments. Oliver and Boyd, Edinburgh, 1935. 1

work page 1935
[9]

Kubric: A scalable dataset generator

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, et al. Kubric: A scalable dataset generator. In CVPR, 2022. 1

work page 2022
[10]

Designing and interpreting probes with control tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. InEMNLP, 2019. 2, 3

work page 2019
[11]

Sim-to-real via sim- to-sim: Data-efficient robotic grasping via randomized-to- canonical adaptation networks

Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex Irpan, Julian Ibarz, Sergey Levine, Raia Hadsell, and Konstantinos Bousmalis. Sim-to-real via sim- to-sim: Data-efficient robotic grasping via randomized-to- canonical adaptation networks. InCVPR, 2019. 2

work page 2019
[12]

Repurposing diffusion-based image generators for monocular depth esti- mation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Mez, Tobias Dauber, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth esti- mation. InCVPR, 2024. 1

work page 2024
[13]

Generative interventions for causal learning

Chengzhi Mao, Augustine Cha, Amogh Gupta, Hao Wang, Junfeng Yang, and Carl V ondrick. Generative interventions for causal learning. InCVPR, 2021. 2, 3

work page 2021
[14]

dsprites: Disentanglement testing sprites dataset

Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https : / / github . com / deepmind / dsprites - dataset/, 2017. 3

work page 2017
[15]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 1

work page 2022
[16]

CAD2RL: Real single-image flight without a single real image

Fereshteh Sadeghi and Sergey Levine. CAD2RL: Real single-image flight without a single real image. InRSS, 2017. 2

work page 2017
[17]

Active learning literature survey.Computer Sciences Technical Report 1648, University of Wisconsin– Madison, 2009

Burr Settles. Active learning literature survey.Computer Sciences Technical Report 1648, University of Wisconsin– Madison, 2009. 2, 3

work page 2009
[18]

Domain randomization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Woj- ciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), 2017. 2, 3

work page 2017
[19]

Training deep networks with synthetic data: Bridging the reality gap by domain randomization

Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brober, Varun Jampani, Cem Anil, Thang To, Eric Camer- acci, Shaad Boochoon, and Stan Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. InCVPR Workshops, 2018. 1, 2

work page 2018
[20]

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A. Efros. Dataset distillation.arXiv preprint arXiv:1811.10959, 2018. 2

work page arXiv 2018
[21]

Difficulty-controlled diffusion model for effective synthetic dataset generation.arXiv preprint arXiv:2411.18109, 2024

Zerun Wang, Chonghao Sui, Han Sun, Xiaojie Wang, Qiong- Hai Dai, and Yu-Chun Li. Difficulty-controlled diffusion model for effective synthetic dataset generation.arXiv preprint arXiv:2411.18109, 2024. 1, 2

work page arXiv 2024
[22]

Open-vocabulary panop- tic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In CVPR, 2023. 1

work page 2023
[23]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 1, 7

work page 2023
[24]

Dataset condensation with gradient matching

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. InICLR, 2021. 2

work page 2021
[25]

Generative Adversarial Active Learning

Jia-Jie Zhu and Jos ´e Bento. Generative adversarial active learning.arXiv preprint arXiv:1702.07956, 2017. 2 8 Synthetic Designed Experiments for Diagnosing Vision Model Failures Supplementary Material Table S6. Training hyperparameters across experiments. Exp 1 Exp 2 Exp 3 Learning rate3×10 −4 1×10 −3 1×10 −3 Batch size 256 32 64 Epochs (biased) 15 30 1...

work page internal anchor Pith review Pith/arXiv arXiv 2017