pith. machine review for the scientific record. sign in

arxiv: 2605.09425 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI

Recognition: no theorem link

AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

Shogo Noguchi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multi-condition diffusionconflict suppressionattention mechanismimage generationsemantic segmentationdepth mapsedge conditionsdriving scene augmentation
0
0 comments X

The pith

An attention mechanism in multi-condition diffusion models suppresses conflicts between segmentation, depth, and edge inputs to preserve more scene structure in generated driving images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve a practical problem in using diffusion models for data augmentation in autonomous driving: when multiple conditions like semantic segmentation, depth, and edges are applied together, they often conflict and erode the high-level layout of the original scene. The authors input all three extracted signals into a diffusion model and add an attention-based module that identifies and dampens the conflicting parts of those signals during generation. This produces images that stay closer to the original structural details while still varying in appearance. The result is synthetic training data that retains annotations and can be fed into high-level driving models for tasks such as behavior understanding. The work also supplies a generation pipeline and evaluation protocol so later methods can be compared directly on the same driving-scene criteria.

Core claim

Inputting semantic segmentation, depth, and edge maps extracted from an original driving image into a diffusion model, combined with an attention-based conflict suppression step, produces generated images that retain stronger high-level structural cues than single-condition or unadjusted multi-condition baselines.

What carries the argument

Attention-based conflict suppression that detects inconsistent signals across the segmentation, depth, and edge conditioning inputs and reduces their influence during the diffusion denoising process.

If this is right

  • Generated images keep more detailed structural information, making them suitable for augmenting data used in traffic-rule extraction and driving-behavior models.
  • Annotations from the original images remain usable on the synthetic outputs, allowing direct improvement of recognition performance without relabeling.
  • A standardized generation framework and evaluation protocol now exists for measuring structural fidelity in multi-condition driving-scene generation.
  • Condition conflicts are treated as a solvable modeling problem rather than an inherent limit of multi-condition diffusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention suppression pattern could be tested on other sets of conflicting conditions, such as pose and depth in human-image generation.
  • If the method scales without extra compute cost, it may lower the amount of real-world driving footage needed to train perception systems.
  • Extending the conflict detector to handle temporal consistency across video frames would be a direct next step for video-based driving augmentation.

Load-bearing premise

That an attention mechanism can reliably detect and suppress conflicts among semantic segmentation, depth, and edge conditions without introducing new artifacts or reducing overall image fidelity in driving scenes.

What would settle it

Quantitative structure-preservation scores and downstream task performance on the authors' proposed evaluation protocol show no statistically significant gain, or show added artifacts, when the attention suppression module is added versus a plain multi-condition diffusion baseline.

Figures

Figures reproduced from arXiv: 2605.09425 by Shogo Noguchi.

Figure 1
Figure 1. Figure 1: Conceptual comparison between conventional image-engineering augmentation and generative augmentation. Conventional operations such as mirroring, rotation, noise, and clipping increase sample count but do not create semantic changes in weather or time. Generative augmentation can change appearance conditions such as fog, rain, snow, and night while preserving the original scene structure. may also remain r… view at source ↗
Figure 2
Figure 2. Figure 2: Example of the limitation of semantic-mask-only generation, motivated by DGInStyle-style augmentation [39]. If the semantic mask is the only structural condition, details not explicitly constrained by the mask can become weak, and artifacts can appear around road boundaries, vehicles, and other fine structures. and that high-fidelity data generation and simulation are important for data-centric closed-loop… view at source ↗
Figure 3
Figure 3. Figure 3: Condition conflict in multi-condition generation. Depth provides low-frequency spatial and distance constraints, while RGB or edge constraints emphasize higher-frequency texture and contour information. When these conditions are injected together without conflict handling, model guidance can become inconsistent and produce distorted or semantically mismatched results. introduces a unified multi-control fra… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of monocular depth maps. ZoeDepth can saturate in distant regions, whereas Metric3Dv2 preserves smoother distant gradients and clearer road-depth structure in this example. Since this work evaluates whether the generated image preserves the original perspective and relative geometry, stable relative depth is important. Depth. Monocular depth estimation is important for 3D geometry in driving sce… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Canny and HED-style edge extraction. Canny preserves fine sign contours and thin structures more explicitly in this example, which is useful when the goal is structural preservation for driving scenes. 2.4 Attention, tri-attention, and discrete selection Transformer attention computes relationships among Query, Key, and Value matrices [80]: Attention(Q, K, V ) = softmax QK⊤ √ dk ! V. (1) This… view at source ↗
Figure 6
Figure 6. Figure 6: Overall multi-condition generation pipeline. The input RGB image is converted into weather/time estimates, semantic segmentation, depth, and edges. A prompt is generated to change appearance, and the prompt plus local conditions are passed to the multi-condition generator [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt-generation pipeline. CLIP estimates the source weather and time subgroup. Semantic segmentation supplies object names. Qwen3-VL generates a caption without explicit weather/time words. A style dictionary then adds target weather/time adjectives and decorations to build the final prompt. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Model detail. The feature extractor is initialized from Uni-ControlNet local-control weights and processes three structural conditions. Stable Diffusion, the VAE, and the text encoder are frozen, so training mainly adapts the local control branch. 3.5 Patch-wise Adaptation Module The base generator feeds all three conditions into a shared local feature extractor. This is compatible with Uni-ControlNet but … view at source ↗
Figure 9
Figure 9. Figure 9: Patch-wise Adaptation Module. Edge, depth, and semantic segmentation are processed by condition-specific stems. At each local feature-grid position, a tri-attention-style gate scores the three conditions, and a straight-through hard selector chooses the locally effective condition before multi-stage control injection. The tri-attention block is not used to continuously fuse all features. It is used as a sc… view at source ↗
Figure 10
Figure 10. Figure 10: Common-projection structure evaluation. The original image and generated image are passed through the same structural projectors. Original-side outputs are treated as pseudo ground truth, and generated-side outputs are compared in semantic, depth, edge, and object spaces rather than in raw RGB pixel space. metrics are semantic segmentation mIoU, depth RMSE, masked edge L1 error, and traffic-object preserv… view at source ↗
Figure 11
Figure 11. Figure 11: Intuition for realism evaluation. Low-realism images may preserve layout but show unnatural texture, color, or lighting. Higher-realism images show more plausible light, shadow, and atmosphere. CLIP￾CMMD is smaller for the latter [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Intuition for diversity evaluation. Similar generated image pairs have small distances, while pairs with substantially different weather, road condition, and illumination have larger distances. LPIPS and 1 − MS-SSIM quantify this separation [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Intuition for text-alignment evaluation. The correct prompt and 99 mismatched prompts are ranked by CLIP similarity to the generated image. R-Precision is high when the correct prompt appears near the top. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison of training with and without Uni-ControlNet initialization. Without pretraining, road structure and vehicle geometry can collapse or hallucinate. With pretraining, the generated image better preserves road shape and object placement. that pretrained local-control representations help the model interpret multiple structural conditions and inject them into the diffusion U-Net. Object … view at source ↗
Figure 15
Figure 15. Figure 15: Scaling behavior of structure-related metrics for Tune models. Depth RMSE and edge L1 are lower-is-better; semantic mIoU and object F1 are higher-is-better. Most improvements are largest between 0 and 30K, become milder from 30K to 60K, and approach saturation near 90K. mIoU/depth/object scores illustrates why multiple structural metrics are necessary. 5.2 Scaling with training steps [PITH_FULL_IMAGE:fig… view at source ↗
Figure 16
Figure 16. Figure 16: Scaling behavior of quality, diversity, and text-alignment metrics for Tune models. CLIP-CMMD is lower-is-better, while R-Precision@5 and LPIPS are higher-is-better. Realism improves until 60K and slightly reverses at 90K [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative comparison across training steps. Prior-work baselines can show unrealistic style or structural changes, while the Tune models better preserve the original road structure and realism as training proceeds. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Zoomed local artifact that can remain even when training steps are increased. A vehicle in the original can disappear or change into another structure, motivating local condition-conflict suppression [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative comparison around PAM. The models change the original appearance toward a snowy road condition, but distant structural consistency differs [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Zoomed example of distant structure preservation. Tune60K weakens road continuity and distorts distant trees, while PAM60K preserves the distant structure more consistently. alone. 6 Discussion 6.1 Why pretraining matters The pretraining result suggests that Uni-ControlNet contains useful local-condition interpretation priors. In this work the Stable Diffusion U-Net is frozen, so early fine-tuning depends… view at source ↗
read the original abstract

Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high-level driving tasks such as traffic-rule extraction and driving-behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high-level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi-condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi-condition generation and provides an important step toward mitigating data scarcity in high-level autonomous-driving tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces AtteConDA, an attention-based conflict suppression module inserted into a multi-condition diffusion pipeline. It takes semantic segmentation, depth, and edge maps extracted from the same source image as conditions, with the goal of generating augmented images that preserve detailed high-level structure in driving scenes better than naive multi-conditioning. The authors also describe a generation framework and evaluation protocol for driving tasks and position the work as addressing data scarcity for high-level autonomous-driving applications such as traffic-rule extraction.

Significance. If the attention mechanism demonstrably reduces condition conflicts while maintaining image fidelity and annotation consistency, the approach could provide a practical tool for synthetic data augmentation in computer vision for autonomous driving. The framing of conflicts as spatially incompatible signals and the introduction of a dedicated evaluation protocol for driving scenes are constructive steps that could serve as a basis for future comparisons, provided quantitative validation is supplied.

major comments (1)
  1. Abstract: the central claim that the proposed attention-based modeling approach 'enables image generation with stronger structural preservation' is asserted without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. The evaluation protocol for driving tasks is mentioned but not described, leaving the load-bearing empirical support for the claim unaddressed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concern regarding the abstract and the supporting empirical evidence below, and we have revised the manuscript to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: Abstract: the central claim that the proposed attention-based modeling approach 'enables image generation with stronger structural preservation' is asserted without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. The evaluation protocol for driving tasks is mentioned but not described, leaving the load-bearing empirical support for the claim unaddressed.

    Authors: We agree that the abstract, as a concise summary, does not itself contain the quantitative details or protocol description, which leaves the central claim insufficiently supported within that section alone. The body of the manuscript reports the relevant experiments, including quantitative metrics for structural preservation (e.g., consistency with input annotations), baseline comparisons against single-condition and naive multi-condition diffusion models, ablations isolating the attention-based conflict suppression component, and error analysis on generated driving scenes. The evaluation protocol is described in the dedicated section on the generation framework and driving-task metrics. To directly address the referee's point, we have revised the abstract to incorporate a brief summary of the key quantitative improvements and a short description of the evaluation protocol, ensuring the empirical support is referenced at the point where the claim is made. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript introduces an attention-based conflict suppression module as a novel modeling addition to multi-condition diffusion pipelines for driving scene augmentation. No equations, fitted parameters, or predictions are shown that reduce by construction to prior inputs or self-citations. The central claim—that the module enables stronger structural preservation—follows from the explicit construction of the module itself rather than from any re-expression of fitted quantities or load-bearing self-citations. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method appears to rest on standard diffusion-model assumptions plus the unstated premise that attention can resolve condition conflicts.

pith-pipeline@v0.9.0 · 5525 in / 1109 out tokens · 50429 ms · 2026-05-12T02:34:30.080762+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 7 internal anchors

  1. [1]

    Li, Shai Dekel, Omri Fried, Idan Rubinstein, Michael Elad, and Lior Wolf

    Omer Bar-Tal, Hila Manor, Kevin Y. Li, Shai Dekel, Omri Fried, Idan Rubinstein, Michael Elad, and Lior Wolf. Multidiffusion: Fusing diffusion paths for controlled image generation. InProceedings of the International Conference on Machine Learning, 2023

  2. [2]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. 36 Algorithm 11CLIP-R-Precision evaluation Require:Generated images{yi}N i=1, matched prompts{t+ i}N i=1, mismatch prompt pool, depthK Ensure:R-Precision@K 1:fori= 1toNdo 2:C...

  3. [3]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023

  4. [4]

    Sutherland, Michael Arbel, and Arthur Gretton

    Mikołaj Bińkowski, Danica J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. InInternational Conference on Learning Representations, 2018

  5. [5]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  6. [6]

    Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  7. [7]

    A computational approach to edge detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986

    John Canny. A computational approach to edge detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986

  8. [8]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InProceedings of the European Conference on Computer Vision, 2020

  9. [9]

    Driving by the rules: A benchmark for integrating traffic sign regulations into vectorized hd map

    Xinyuan Chang, Maixuan Xue, Xinran Liu, Zheng Pan, and Xing Wei. Driving by the rules: A benchmark for integrating traffic sign regulations into vectorized hd map. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  10. [10]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, 2018

  11. [11]

    Schwing, Alexander Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  12. [12]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 37

  13. [13]

    Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V

    Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Au- toaugment: Learning augmentation strategies from data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

  14. [14]

    Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V

    Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space. InAdvances in Neural Information Processing Systems, 2020

  15. [15]

    Talk2car: Taking control of your self-driving car

    Thierry Deruyttere, Simon Vandenhende, Davy Neven, Marc Proesmans, and Luc Van Gool. Talk2car: Taking control of your self-driving car. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019

  16. [16]

    Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout.arXiv preprint arXiv:1708.04552, 2017

  17. [17]

    Carla: An open urban driving simulator

    Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InProceedings of the Conference on Robot Learning, 2017

  18. [18]

    Depth map prediction from a single image using a multi-scale deep network

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InAdvances in Neural Information Processing Systems, 2014

  19. [19]

    Esser, Jeffrey L

    Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Silq: Simple large language model quantization-aware training.arXiv preprint arXiv:2507.16933, 2025

  20. [20]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012

  21. [21]

    Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

  22. [22]

    Syndiff-ad: Improving semantic segmentation and end-to-end autonomous driving with synthetic data from latent diffusion models.arXiv preprint arXiv:2411.16776, 2024

    Harsh Goel, Sai Shankar Narasimhan, Oguzhan Akcin, and Sandeep Chinchali. Syndiff-ad: Improving semantic segmentation and end-to-end autonomous driving with synthetic data from latent diffusion models.arXiv preprint arXiv:2411.16776, 2024

  23. [23]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems, 2014

  24. [24]

    Borgwardt, Malte J

    Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch"olkopf, and Alexander Smola. A kernel two-sample test. InJournal of Machine Learning Research, volume 13, pages 723–773, 2012

  25. [25]

    Visual traffic knowledge graph generation from scene images

    Yunfei Guo, Fei Yin, Xiao hui Li, Xudong Yan, Tao Xue, Shuqi Mei, and Cheng-Lin Liu. Visual traffic knowledge graph generation from scene images. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  26. [26]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 38

  27. [27]

    Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan

    Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. InInternational Conference on Learning Representations, 2020

  28. [28]

    Prompt-to-prompt image editing with cross attention control.International Conference on Learning Representations, 2023

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.International Conference on Learning Representations, 2023

  29. [29]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

  30. [30]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems, 2017

  31. [31]

    Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 2020

  32. [32]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  33. [33]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Kaixuan Wang, Hao Chen, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3dv2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  34. [34]

    Composer: Creative and controllable image synthesis with composable conditions

    Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. InProceedings of the International Conference on Machine Learning, 2023

  35. [35]

    Multimodal unsupervised image- to-image translation

    Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image- to-image translation. InProceedings of the European Conference on Computer Vision, 2018

  36. [36]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

  37. [37]

    Oneformer: One transformer to rule universal image segmentation

    Jitesh Jain, Jiachen Li, Mang-Tik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  38. [38]

    Rethinking fid: Towards a better evaluation metric for image generation

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  39. [39]

    Dginstyle: Domain-generalizable semantic segmentation with image diffusion models and stylized semantic control

    Yuru Jia, Lukas Hoyer, Shengyu Huang, Tianfu Wang, Luc Van Gool, Konrad Schindler, and Anton Obukhov. Dginstyle: Domain-generalizable semantic segmentation with image diffusion models and stylized semantic control. InEuropean Conference on Computer Vision, 2024

  40. [40]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes.International Conference on Learning Representations, 2014. 39

  41. [41]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems, 2012

  42. [42]

    Image data augmentation approaches: A comprehensive survey and future directions.arXiv preprint arXiv:2301.02830, 2023

    Tarun Kumar et al. Image data augmentation approaches: A comprehensive survey and future directions.arXiv preprint arXiv:2301.02830, 2023

  43. [43]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InAdvances in Neural Information Processing Systems, 2019

  44. [44]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning, 2023

  45. [45]

    Data-centric evolution in autonomous driving: A comprehensive survey of big data system, data mining, and closed-loop technologies.arXiv preprint arXiv:2401.12888, 2024

    Lincan Li, Wei Shao, Wei Dong, Yijun Tian, Qiming Zhang, Kaixiang Yang, and Wenjie Zhang. Data-centric evolution in autonomous driving: A comprehensive survey of big data system, data mining, and closed-loop technologies.arXiv preprint arXiv:2401.12888, 2024

  46. [46]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  47. [47]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485, 2023

  48. [48]

    Unsupervised image-to-image translation networks

    Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. InAdvances in Neural Information Processing Systems, 2017

  49. [49]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision, 2024

  50. [50]

    Fully convolutional networks for semantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015

  51. [51]

    Pv-tuning: Beyond straight-through estimation for extreme llm compression.arXiv preprint arXiv:2405.14852, 2024

    Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, and Peter Richtarik. Pv-tuning: Beyond straight-through estimation for extreme llm compression.arXiv preprint arXiv:2405.14852, 2024

  52. [52]

    Sdedit: Guided image synthesis and editing with stochastic differential equations.International Conference on Learning Representations, 2022

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jue Wang, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.International Conference on Learning Representations, 2022

  53. [53]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 2024

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 2024

  54. [54]

    M"uller and Frank Hutter

    Samuel G. M"uller and Frank Hutter. Trivialaugment: Tuning-free yet state-of-the-art data augmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 40

  55. [55]

    A survey of synthetic data augmentation methods in computer vision.arXiv preprint arXiv:2403.10075, 2024

    Alhassan Mumuni, Fuseini Mumuni, and Nana Kobina Gerrar. A survey of synthetic data augmentation methods in computer vision.arXiv preprint arXiv:2403.10075, 2024

  56. [56]

    The mapillary vistas dataset for semantic understanding of street scenes

    Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. InProceedings of the IEEE International Conference on Computer Vision, 2017

  57. [57]

    Improved denoising diffusion probabilistic models.Pro- ceedings of the International Conference on Machine Learning, 2021

    Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models.Pro- ceedings of the International Conference on Machine Learning, 2021

  58. [58]

    Pixelponder: Dynamic patch adaptation for enhanced multi-conditional text-to-image generation.arXiv preprint arXiv:2503.06684, 2025

    Yanjie Pan, Qingdong He, Zhengkai Jiang, Pengcheng Xu, Chaoyi Wang, Jinlong Peng, Haoxuan Wang, Yun Cao, Zhenye Gan, Mingmin Chi, Bo Peng, and Yabiao Wang. Pixelponder: Dynamic patch adaptation for enhanced multi-conditional text-to-image generation.arXiv preprint arXiv:2503.06684, 2025

  59. [59]

    Semantic image synthesis with spatially-adaptive normalization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

  60. [60]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

  61. [61]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. Proceedings of Machine Learning Research, 139:8748–8763, 2021

  62. [62]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021

  63. [63]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2022

  64. [64]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Processing Systems, 2015

  65. [65]

    Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun

    Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. InProceedings of the European Conference on Computer Vision, 2016

  66. [66]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj"orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  67. [67]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention, 2015. 41

  68. [68]

    German Ros, Laura Sellart, Joanna Materzynska, David Vázquez, and Antonio M. López. The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

  69. [69]

    Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models.ACM Transactions on Graphics, 41(6), 2022

  70. [70]

    Languagempc: Large language models as deci- sion makers for autonomous driving,

    Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, and Mingyu Ding. Languagempc: Large language models as decision makers for autonomous driving.arXiv preprint arXiv:2310.03026, 2023

  71. [71]

    Khoshgoftaar

    Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning.Journal of Big Data, 6(1):60, 2019

  72. [72]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beisswenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InProceedings of the European Conference on Computer Vision, 2024

  73. [73]

    Deep unsu- pervised learning using nonequilibrium thermodynamics.Proceedings of the International Conference on Machine Learning, 2015

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsu- pervised learning using nonequilibrium thermodynamics.Proceedings of the International Conference on Machine Learning, 2015

  74. [74]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. International Conference on Learning Representations, 2021

  75. [75]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations, 2021

  76. [76]

    Pixel difference networks for efficient edge detection

    Zhenyu Su, Wenzhe Liu, Sheng Wang, Xiaofei Zhai, and Kui Ren. Pixel difference networks for efficient edge detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021

  77. [77]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurélien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

  78. [78]

    Anycontrol: Create your artwork with versatile control on text-to-image generation.arXiv preprint arXiv:2406.18958, 2024

    Yanan Sun, Yanchen Liu, Yinhao Tang, Wenjie Pei, and Kai Chen. Anycontrol: Create your artwork with versatile control on text-to-image generation.arXiv preprint arXiv:2406.18958, 2024

  79. [79]

    Training deep networks with synthetic data: Bridging the reality gap by domain randomization

    Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018. 42

  80. [80]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

Showing first 80 references.