arxiv: 2605.09425 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI

Recognition: no theorem link

AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

Shogo Noguchi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-condition diffusionconflict suppressionattention mechanismimage generationsemantic segmentationdepth mapsedge conditionsdriving scene augmentation

0 comments

The pith

An attention mechanism in multi-condition diffusion models suppresses conflicts between segmentation, depth, and edge inputs to preserve more scene structure in generated driving images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve a practical problem in using diffusion models for data augmentation in autonomous driving: when multiple conditions like semantic segmentation, depth, and edges are applied together, they often conflict and erode the high-level layout of the original scene. The authors input all three extracted signals into a diffusion model and add an attention-based module that identifies and dampens the conflicting parts of those signals during generation. This produces images that stay closer to the original structural details while still varying in appearance. The result is synthetic training data that retains annotations and can be fed into high-level driving models for tasks such as behavior understanding. The work also supplies a generation pipeline and evaluation protocol so later methods can be compared directly on the same driving-scene criteria.

Core claim

Inputting semantic segmentation, depth, and edge maps extracted from an original driving image into a diffusion model, combined with an attention-based conflict suppression step, produces generated images that retain stronger high-level structural cues than single-condition or unadjusted multi-condition baselines.

What carries the argument

Attention-based conflict suppression that detects inconsistent signals across the segmentation, depth, and edge conditioning inputs and reduces their influence during the diffusion denoising process.

If this is right

Generated images keep more detailed structural information, making them suitable for augmenting data used in traffic-rule extraction and driving-behavior models.
Annotations from the original images remain usable on the synthetic outputs, allowing direct improvement of recognition performance without relabeling.
A standardized generation framework and evaluation protocol now exists for measuring structural fidelity in multi-condition driving-scene generation.
Condition conflicts are treated as a solvable modeling problem rather than an inherent limit of multi-condition diffusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention suppression pattern could be tested on other sets of conflicting conditions, such as pose and depth in human-image generation.
If the method scales without extra compute cost, it may lower the amount of real-world driving footage needed to train perception systems.
Extending the conflict detector to handle temporal consistency across video frames would be a direct next step for video-based driving augmentation.

Load-bearing premise

That an attention mechanism can reliably detect and suppress conflicts among semantic segmentation, depth, and edge conditions without introducing new artifacts or reducing overall image fidelity in driving scenes.

What would settle it

Quantitative structure-preservation scores and downstream task performance on the authors' proposed evaluation protocol show no statistically significant gain, or show added artifacts, when the attention suppression module is added versus a plain multi-condition diffusion baseline.

Figures

Figures reproduced from arXiv: 2605.09425 by Shogo Noguchi.

**Figure 1.** Figure 1: Conceptual comparison between conventional image-engineering augmentation and generative augmentation. Conventional operations such as mirroring, rotation, noise, and clipping increase sample count but do not create semantic changes in weather or time. Generative augmentation can change appearance conditions such as fog, rain, snow, and night while preserving the original scene structure. may also remain r… view at source ↗

**Figure 2.** Figure 2: Example of the limitation of semantic-mask-only generation, motivated by DGInStyle-style augmentation [39]. If the semantic mask is the only structural condition, details not explicitly constrained by the mask can become weak, and artifacts can appear around road boundaries, vehicles, and other fine structures. and that high-fidelity data generation and simulation are important for data-centric closed-loop… view at source ↗

**Figure 3.** Figure 3: Condition conflict in multi-condition generation. Depth provides low-frequency spatial and distance constraints, while RGB or edge constraints emphasize higher-frequency texture and contour information. When these conditions are injected together without conflict handling, model guidance can become inconsistent and produce distorted or semantically mismatched results. introduces a unified multi-control fra… view at source ↗

**Figure 4.** Figure 4: Comparison of monocular depth maps. ZoeDepth can saturate in distant regions, whereas Metric3Dv2 preserves smoother distant gradients and clearer road-depth structure in this example. Since this work evaluates whether the generated image preserves the original perspective and relative geometry, stable relative depth is important. Depth. Monocular depth estimation is important for 3D geometry in driving sce… view at source ↗

**Figure 5.** Figure 5: Comparison of Canny and HED-style edge extraction. Canny preserves fine sign contours and thin structures more explicitly in this example, which is useful when the goal is structural preservation for driving scenes. 2.4 Attention, tri-attention, and discrete selection Transformer attention computes relationships among Query, Key, and Value matrices [80]: Attention(Q, K, V ) = softmax QK⊤ √ dk ! V. (1) This… view at source ↗

**Figure 6.** Figure 6: Overall multi-condition generation pipeline. The input RGB image is converted into weather/time estimates, semantic segmentation, depth, and edges. A prompt is generated to change appearance, and the prompt plus local conditions are passed to the multi-condition generator [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt-generation pipeline. CLIP estimates the source weather and time subgroup. Semantic segmentation supplies object names. Qwen3-VL generates a caption without explicit weather/time words. A style dictionary then adds target weather/time adjectives and decorations to build the final prompt. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Model detail. The feature extractor is initialized from Uni-ControlNet local-control weights and processes three structural conditions. Stable Diffusion, the VAE, and the text encoder are frozen, so training mainly adapts the local control branch. 3.5 Patch-wise Adaptation Module The base generator feeds all three conditions into a shared local feature extractor. This is compatible with Uni-ControlNet but … view at source ↗

**Figure 9.** Figure 9: Patch-wise Adaptation Module. Edge, depth, and semantic segmentation are processed by condition-specific stems. At each local feature-grid position, a tri-attention-style gate scores the three conditions, and a straight-through hard selector chooses the locally effective condition before multi-stage control injection. The tri-attention block is not used to continuously fuse all features. It is used as a sc… view at source ↗

**Figure 10.** Figure 10: Common-projection structure evaluation. The original image and generated image are passed through the same structural projectors. Original-side outputs are treated as pseudo ground truth, and generated-side outputs are compared in semantic, depth, edge, and object spaces rather than in raw RGB pixel space. metrics are semantic segmentation mIoU, depth RMSE, masked edge L1 error, and traffic-object preserv… view at source ↗

**Figure 11.** Figure 11: Intuition for realism evaluation. Low-realism images may preserve layout but show unnatural texture, color, or lighting. Higher-realism images show more plausible light, shadow, and atmosphere. CLIPCMMD is smaller for the latter [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Intuition for diversity evaluation. Similar generated image pairs have small distances, while pairs with substantially different weather, road condition, and illumination have larger distances. LPIPS and 1 − MS-SSIM quantify this separation [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Intuition for text-alignment evaluation. The correct prompt and 99 mismatched prompts are ranked by CLIP similarity to the generated image. R-Precision is high when the correct prompt appears near the top. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison of training with and without Uni-ControlNet initialization. Without pretraining, road structure and vehicle geometry can collapse or hallucinate. With pretraining, the generated image better preserves road shape and object placement. that pretrained local-control representations help the model interpret multiple structural conditions and inject them into the diffusion U-Net. Object … view at source ↗

**Figure 15.** Figure 15: Scaling behavior of structure-related metrics for Tune models. Depth RMSE and edge L1 are lower-is-better; semantic mIoU and object F1 are higher-is-better. Most improvements are largest between 0 and 30K, become milder from 30K to 60K, and approach saturation near 90K. mIoU/depth/object scores illustrates why multiple structural metrics are necessary. 5.2 Scaling with training steps [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 16.** Figure 16: Scaling behavior of quality, diversity, and text-alignment metrics for Tune models. CLIP-CMMD is lower-is-better, while R-Precision@5 and LPIPS are higher-is-better. Realism improves until 60K and slightly reverses at 90K [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative comparison across training steps. Prior-work baselines can show unrealistic style or structural changes, while the Tune models better preserve the original road structure and realism as training proceeds. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Zoomed local artifact that can remain even when training steps are increased. A vehicle in the original can disappear or change into another structure, motivating local condition-conflict suppression [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative comparison around PAM. The models change the original appearance toward a snowy road condition, but distant structural consistency differs [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Zoomed example of distant structure preservation. Tune60K weakens road continuity and distorts distant trees, while PAM60K preserves the distant structure more consistently. alone. 6 Discussion 6.1 Why pretraining matters The pretraining result suggests that Uni-ControlNet contains useful local-condition interpretation priors. In this work the Stable Diffusion U-Net is frozen, so early fine-tuning depends… view at source ↗

read the original abstract

Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high-level driving tasks such as traffic-rule extraction and driving-behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high-level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi-condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi-condition generation and provides an important step toward mitigating data scarcity in high-level autonomous-driving tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds an attention module to suppress conflicts among segmentation, depth, and edge conditions in diffusion models for driving-scene augmentation, and the logic holds without hidden assumptions.

read the letter

The main takeaway is that they identify conflicts as a blocker when stacking multiple structural conditions for synthetic driving data, then insert an attention-based handler to re-weight or mask incompatible signals at the same location. This is presented as enabling better preservation of high-level scene structure than naive multi-conditioning, which matters for tasks like traffic-rule extraction where simple augmentation falls short.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces AtteConDA, an attention-based conflict suppression module inserted into a multi-condition diffusion pipeline. It takes semantic segmentation, depth, and edge maps extracted from the same source image as conditions, with the goal of generating augmented images that preserve detailed high-level structure in driving scenes better than naive multi-conditioning. The authors also describe a generation framework and evaluation protocol for driving tasks and position the work as addressing data scarcity for high-level autonomous-driving applications such as traffic-rule extraction.

Significance. If the attention mechanism demonstrably reduces condition conflicts while maintaining image fidelity and annotation consistency, the approach could provide a practical tool for synthetic data augmentation in computer vision for autonomous driving. The framing of conflicts as spatially incompatible signals and the introduction of a dedicated evaluation protocol for driving scenes are constructive steps that could serve as a basis for future comparisons, provided quantitative validation is supplied.

major comments (1)

Abstract: the central claim that the proposed attention-based modeling approach 'enables image generation with stronger structural preservation' is asserted without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. The evaluation protocol for driving tasks is mentioned but not described, leaving the load-bearing empirical support for the claim unaddressed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concern regarding the abstract and the supporting empirical evidence below, and we have revised the manuscript to strengthen the presentation of our results.

read point-by-point responses

Referee: Abstract: the central claim that the proposed attention-based modeling approach 'enables image generation with stronger structural preservation' is asserted without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. The evaluation protocol for driving tasks is mentioned but not described, leaving the load-bearing empirical support for the claim unaddressed.

Authors: We agree that the abstract, as a concise summary, does not itself contain the quantitative details or protocol description, which leaves the central claim insufficiently supported within that section alone. The body of the manuscript reports the relevant experiments, including quantitative metrics for structural preservation (e.g., consistency with input annotations), baseline comparisons against single-condition and naive multi-condition diffusion models, ablations isolating the attention-based conflict suppression component, and error analysis on generated driving scenes. The evaluation protocol is described in the dedicated section on the generation framework and driving-task metrics. To directly address the referee's point, we have revised the abstract to incorporate a brief summary of the key quantitative improvements and a short description of the evaluation protocol, ensuring the empirical support is referenced at the point where the claim is made. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript introduces an attention-based conflict suppression module as a novel modeling addition to multi-condition diffusion pipelines for driving scene augmentation. No equations, fitted parameters, or predictions are shown that reduce by construction to prior inputs or self-citations. The central claim—that the module enables stronger structural preservation—follows from the explicit construction of the module itself rather than from any re-expression of fitted quantities or load-bearing self-citations. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method appears to rest on standard diffusion-model assumptions plus the unstated premise that attention can resolve condition conflicts.

pith-pipeline@v0.9.0 · 5525 in / 1109 out tokens · 50429 ms · 2026-05-12T02:34:30.080762+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 7 internal anchors

[1]

Li, Shai Dekel, Omri Fried, Idan Rubinstein, Michael Elad, and Lior Wolf

Omer Bar-Tal, Hila Manor, Kevin Y. Li, Shai Dekel, Omri Fried, Idan Rubinstein, Michael Elad, and Lior Wolf. Multidiffusion: Fusing diffusion paths for controlled image generation. InProceedings of the International Conference on Machine Learning, 2023

work page 2023
[2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. 36 Algorithm 11CLIP-R-Precision evaluation Require:Generated images{yi}N i=1, matched prompts{t+ i}N i=1, mismatch prompt pool, depthK Ensure:R-Precision@K 1:fori= 1toNdo 2:C...

work page internal anchor Pith review Pith/arXiv arXiv 2013
[3]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review arXiv 2023
[4]

Sutherland, Michael Arbel, and Arthur Gretton

Mikołaj Bińkowski, Danica J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. InInternational Conference on Learning Representations, 2018

work page 2018
[5]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[6]

Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

work page 2020
[7]

A computational approach to edge detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986

John Canny. A computational approach to edge detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986

work page 1986
[8]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InProceedings of the European Conference on Computer Vision, 2020

work page 2020
[9]

Driving by the rules: A benchmark for integrating traffic sign regulations into vectorized hd map

Xinyuan Chang, Maixuan Xue, Xinran Liu, Zheng Pan, and Xing Wei. Driving by the rules: A benchmark for integrating traffic sign regulations into vectorized hd map. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[10]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, 2018

work page 2018
[11]

Schwing, Alexander Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[12]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 37

work page 2016
[13]

Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V

Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Au- toaugment: Learning augmentation strategies from data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[14]

Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V

Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[15]

Talk2car: Taking control of your self-driving car

Thierry Deruyttere, Simon Vandenhende, Davy Neven, Marc Proesmans, and Luc Van Gool. Talk2car: Taking control of your self-driving car. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019

work page 2019
[16]

Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout.arXiv preprint arXiv:1708.04552, 2017

work page internal anchor Pith review arXiv 2017
[17]

Carla: An open urban driving simulator

Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InProceedings of the Conference on Robot Learning, 2017

work page 2017
[18]

Depth map prediction from a single image using a multi-scale deep network

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InAdvances in Neural Information Processing Systems, 2014

work page 2014
[19]

Esser, Jeffrey L

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Silq: Simple large language model quantization-aware training.arXiv preprint arXiv:2507.16933, 2025

work page arXiv 2025
[20]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012
[21]

Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

work page 2019
[22]

Syndiff-ad: Improving semantic segmentation and end-to-end autonomous driving with synthetic data from latent diffusion models.arXiv preprint arXiv:2411.16776, 2024

Harsh Goel, Sai Shankar Narasimhan, Oguzhan Akcin, and Sandeep Chinchali. Syndiff-ad: Improving semantic segmentation and end-to-end autonomous driving with synthetic data from latent diffusion models.arXiv preprint arXiv:2411.16776, 2024

work page arXiv 2024
[23]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems, 2014

work page 2014
[24]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch"olkopf, and Alexander Smola. A kernel two-sample test. InJournal of Machine Learning Research, volume 13, pages 723–773, 2012

work page 2012
[25]

Visual traffic knowledge graph generation from scene images

Yunfei Guo, Fei Yin, Xiao hui Li, Xudong Yan, Tao Xue, Shuqi Mei, and Cheng-Lin Liu. Visual traffic knowledge graph generation from scene images. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 38

work page 2016
[27]

Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan

Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. InInternational Conference on Learning Representations, 2020

work page 2020
[28]

Prompt-to-prompt image editing with cross attention control.International Conference on Learning Representations, 2023

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.International Conference on Learning Representations, 2023

work page 2023
[29]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

work page 2021
[30]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[31]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 2020

work page 2020
[32]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Kaixuan Wang, Hao Chen, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3dv2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[34]

Composer: Creative and controllable image synthesis with composable conditions

Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. InProceedings of the International Conference on Machine Learning, 2023

work page 2023
[35]

Multimodal unsupervised image- to-image translation

Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image- to-image translation. InProceedings of the European Conference on Computer Vision, 2018

work page 2018
[36]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[37]

Oneformer: One transformer to rule universal image segmentation

Jitesh Jain, Jiachen Li, Mang-Tik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[38]

Rethinking fid: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[39]

Dginstyle: Domain-generalizable semantic segmentation with image diffusion models and stylized semantic control

Yuru Jia, Lukas Hoyer, Shengyu Huang, Tianfu Wang, Luc Van Gool, Konrad Schindler, and Anton Obukhov. Dginstyle: Domain-generalizable semantic segmentation with image diffusion models and stylized semantic control. InEuropean Conference on Computer Vision, 2024

work page 2024
[40]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes.International Conference on Learning Representations, 2014. 39

work page 2014
[41]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems, 2012

work page 2012
[42]

Image data augmentation approaches: A comprehensive survey and future directions.arXiv preprint arXiv:2301.02830, 2023

Tarun Kumar et al. Image data augmentation approaches: A comprehensive survey and future directions.arXiv preprint arXiv:2301.02830, 2023

work page arXiv 2023
[43]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[44]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning, 2023

work page 2023
[45]

Data-centric evolution in autonomous driving: A comprehensive survey of big data system, data mining, and closed-loop technologies.arXiv preprint arXiv:2401.12888, 2024

Lincan Li, Wei Shao, Wei Dong, Yijun Tian, Qiming Zhang, Kaixiang Yang, and Wenjie Zhang. Data-centric evolution in autonomous driving: A comprehensive survey of big data system, data mining, and closed-loop technologies.arXiv preprint arXiv:2401.12888, 2024

work page arXiv 2024
[46]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[47]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Unsupervised image-to-image translation networks

Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[49]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision, 2024

work page 2024
[50]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015

work page 2015
[51]

Pv-tuning: Beyond straight-through estimation for extreme llm compression.arXiv preprint arXiv:2405.14852, 2024

Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, and Peter Richtarik. Pv-tuning: Beyond straight-through estimation for extreme llm compression.arXiv preprint arXiv:2405.14852, 2024

work page arXiv 2024
[52]

Sdedit: Guided image synthesis and editing with stochastic differential equations.International Conference on Learning Representations, 2022

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jue Wang, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.International Conference on Learning Representations, 2022

work page 2022
[53]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 2024

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024
[54]

M"uller and Frank Hutter

Samuel G. M"uller and Frank Hutter. Trivialaugment: Tuning-free yet state-of-the-art data augmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 40

work page 2021
[55]

A survey of synthetic data augmentation methods in computer vision.arXiv preprint arXiv:2403.10075, 2024

Alhassan Mumuni, Fuseini Mumuni, and Nana Kobina Gerrar. A survey of synthetic data augmentation methods in computer vision.arXiv preprint arXiv:2403.10075, 2024

work page arXiv 2024
[56]

The mapillary vistas dataset for semantic understanding of street scenes

Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. InProceedings of the IEEE International Conference on Computer Vision, 2017

work page 2017
[57]

Improved denoising diffusion probabilistic models.Pro- ceedings of the International Conference on Machine Learning, 2021

Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models.Pro- ceedings of the International Conference on Machine Learning, 2021

work page 2021
[58]

Pixelponder: Dynamic patch adaptation for enhanced multi-conditional text-to-image generation.arXiv preprint arXiv:2503.06684, 2025

Yanjie Pan, Qingdong He, Zhengkai Jiang, Pengcheng Xu, Chaoyi Wang, Jinlong Peng, Haoxuan Wang, Yun Cao, Zhenye Gan, Mingmin Chi, Bo Peng, and Yabiao Wang. Pixelponder: Dynamic patch adaptation for enhanced multi-conditional text-to-image generation.arXiv preprint arXiv:2503.06684, 2025

work page arXiv 2025
[59]

Semantic image synthesis with spatially-adaptive normalization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[60]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

work page 2018
[61]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. Proceedings of Machine Learning Research, 139:8748–8763, 2021

work page 2021
[62]

Vision transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021

work page 2021
[63]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2022

work page 2022
[64]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Processing Systems, 2015

work page 2015
[65]

Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun

Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. InProceedings of the European Conference on Computer Vision, 2016

work page 2016
[66]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj"orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[67]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention, 2015. 41

work page 2015
[68]

German Ros, Laura Sellart, Joanna Materzynska, David Vázquez, and Antonio M. López. The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

work page 2016
[69]

Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models.ACM Transactions on Graphics, 41(6), 2022

work page 2022
[70]

Languagempc: Large language models as deci- sion makers for autonomous driving,

Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, and Mingyu Ding. Languagempc: Large language models as decision makers for autonomous driving.arXiv preprint arXiv:2310.03026, 2023

work page arXiv 2023
[71]

Khoshgoftaar

Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning.Journal of Big Data, 6(1):60, 2019

work page 2019
[72]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beisswenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InProceedings of the European Conference on Computer Vision, 2024

work page 2024
[73]

Deep unsu- pervised learning using nonequilibrium thermodynamics.Proceedings of the International Conference on Machine Learning, 2015

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsu- pervised learning using nonequilibrium thermodynamics.Proceedings of the International Conference on Machine Learning, 2015

work page 2015
[74]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. International Conference on Learning Representations, 2021

work page 2021
[75]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations, 2021

work page 2021
[76]

Pixel difference networks for efficient edge detection

Zhenyu Su, Wenzhe Liu, Sheng Wang, Xiaofei Zhai, and Kui Ren. Pixel difference networks for efficient edge detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021

work page 2021
[77]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurélien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

work page 2020
[78]

Anycontrol: Create your artwork with versatile control on text-to-image generation.arXiv preprint arXiv:2406.18958, 2024

Yanan Sun, Yanchen Liu, Yinhao Tang, Wenjie Pei, and Kai Chen. Anycontrol: Create your artwork with versatile control on text-to-image generation.arXiv preprint arXiv:2406.18958, 2024

work page arXiv 2024
[79]

Training deep networks with synthetic data: Bridging the reality gap by domain randomization

Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018. 42

work page 2018
[80]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

work page 2017

Showing first 80 references.