arxiv: 2604.20354 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Hallucination Early Detection in Diffusion Models

Federico Betti , Lorenzo Baraldi , Rita Cucchiara , Nicu Sebe

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelstext-to-image generationhallucination detectionearly stoppingcross-attention mapsmulti-object promptsimage synthesis efficiency

0 comments

The pith

HEaD+ spots likely object omissions in diffusion generations at an intermediate step and restarts with a new seed to raise the rate of complete multi-object images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to check early in the diffusion process whether a generation will include every object named in the prompt. It feeds cross-attention maps, text embeddings, and a newly defined Predicted Final Image into a classifier that decides whether to finish the current run or restart with a fresh seed. The classifier is trained on a dataset of 45,000 images generated from prompts containing up to seven objects. When the early check is added to standard diffusion pipelines, the fraction of images that contain all requested objects rises while the average time to reach a complete result falls. The same framework also supplies an optional localization head that predicts object centers and verifies requested spatial relations before generation completes.

Core claim

Cross-attention maps and the Predicted Final Image at an intermediate timestep can be combined to forecast whether every object specified in the prompt will appear in the final output; if the forecast is negative the generation is aborted and restarted with a different seed, producing a 6-8 percent higher rate of complete four-object images and up to 32 percent lower wall-clock time than repeated full runs without early detection.

What carries the argument

HEaD+, a classifier trained to output a continue-or-restart decision from the triplet of cross-attention maps, textual prompt embeddings, and the Predicted Final Image constructed at an intermediate diffusion timestep.

If this is right

The probability of obtaining a complete four-object image rises by 6-8 percent when the early check is used alongside existing diffusion pipelines.
Average time to produce a complete image drops by up to 32 percent because flawed generations are terminated early.
An auxiliary localization module can be run at the same intermediate step to predict object centroids and gate generation on requested spatial relations.
The approach works with any diffusion backbone and requires only a modest additional classifier trained once on the InsideGen collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Restarting diffusion trajectories early may prove cheaper than post-hoc correction or latent-space optimization for multi-object prompts.
The same intermediate-state predictor could be extended to other generative tasks that suffer from partial omissions, such as scene layout or video synthesis.
Seed selection strategies that exploit early signals may reduce the energy cost of large-scale text-to-image services without changing model weights.

Load-bearing premise

Cross-attention maps together with the Predicted Final Image at an intermediate step are reliable enough to predict whether all prompt objects will be present in the finished image, and a new seed will usually succeed without requiring many extra attempts.

What would settle it

A large-scale test in which HEaD+ decisions are recorded but generations are always completed anyway; if the fraction of complete images among the runs that HEaD+ would have restarted is no higher than among the runs it would have kept, the early-detection premise is falsified.

read the original abstract

Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficulties when tasked with generating multiple objects, frequently resulting in hallucinations where certain entities are omitted. While existing solutions typically focus on optimizing latent representations within diffusion models, the relevance of the initial generation seed is typically underestimated. While using various seeds in multiple iterations can improve results, this method also significantly increases time and energy costs. To address this challenge, we introduce HEaD+ (Hallucination Early Detection +), a novel approach designed to identify incorrect generations early in the diffusion process. The HEaD+ framework integrates cross-attention maps and textual information with a novel input, the Predicted Final Image. The objective is to assess whether to proceed with the current generation or restart it with a different seed, thereby exploring multiple-generation seeds while conserving time. HEaD+ is trained on the newly created InsideGen dataset of 45,000 generated images, each containing prompts with up to seven objects. Our findings demonstrate a 6-8% increase in the likelihood of achieving a complete generation (i.e., an image accurately representing all specified subjects) with four objects when applying HEaD+ alongside existing models. Additionally, HEaD+ reduces generation times by up to 32% when aiming for a complete image, enhancing the efficiency of generating complete and accurate object representations relative to leading models. Moreover, we propose an integrated localization module that predicts object centroid positions and verifies pairwise spatial relations (if requested by the users) at an intermediate timestep, gating generation together with object presence to further improve relation-consistent outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HEaD+ gives a practical early-detection trick for multi-object hallucinations in diffusion by adding a Predicted Final Image input, with plausible efficiency claims that still need tighter experimental backing.

read the letter

Hey, the core idea here is that HEaD+ spots likely missing objects partway through diffusion by feeding cross-attention maps, text embeddings, and a new Predicted Final Image into a detector, then restarts with a fresh seed if the outlook looks bad. This targets the real issue that multi-object prompts often drop subjects without anyone noticing until the end. The new pieces are that Predicted Final Image input, the InsideGen dataset of 45k examples, and the localization module that also gates on predicted positions and relations. The paper does a clean job framing seed choice as an underused lever and showing how early intervention can sit on top of existing models without retraining them. The reported 6-8% lift in complete four-object generations and 32% time cut are the kind of numbers that would matter in practice if they hold. The softer part is the evaluation: the abstract states the gains but leaves the protocol, baselines, success metric details, and InsideGen construction/split unspecified, so it is difficult to judge whether the detector precision is high enough to make restarts pay off on average or whether the results generalize beyond their setup. The assumption that a new seed will usually succeed without many extra tries also feels optimistic if the underlying model has consistent biases on certain object combinations. This is aimed at people building or deploying text-to-image systems where prompt fidelity and compute cost both matter, such as design tools or research pipelines. A reader working on reliability tweaks for diffusion would get concrete value from the framework even if the numbers need more scrutiny. I would send it for peer review; the approach is logically grounded and addresses a common limitation with a testable addition, so referees could usefully tighten the experiments.

Referee Report

2 major / 2 minor

Summary. The paper introduces HEaD+, a detector for early identification of hallucinations (missing objects) during the diffusion process in text-to-image models. It combines cross-attention maps, text embeddings, and a novel Predicted Final Image input, trained on the new InsideGen dataset of 45,000 images with prompts containing up to seven objects. The method decides at intermediate timesteps whether to continue the current generation or restart with a new seed. Claims include a 6-8% increase in the rate of complete generations (all prompt objects present) for four-object prompts when used with existing models, up to 32% reduction in generation time for complete images, and an optional localization module that predicts object centroids and checks spatial relations.

Significance. If the quantitative claims hold under rigorous evaluation, the work could offer a practical efficiency gain for multi-object text-to-image generation by avoiding full computation on doomed seeds, which is relevant for applications requiring reliable object presence. The InsideGen dataset may serve as a resource for future hallucination studies. However, the absence of detailed evaluation protocols, baselines, and statistical analysis in the current manuscript limits assessment of whether these gains are robust or generalizable beyond the reported figures.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): The central claims of a 6-8% increase in complete generations for four-object prompts and up to 32% time reduction are unsupported because the manuscript provides no evaluation protocol, baseline comparisons (e.g., against naive multi-seed sampling or other hallucination mitigation methods), statistical tests, or details on InsideGen construction, train/test splits, or how success (object presence) is measured. This directly affects the load-bearing quantitative results.
[§3 and §4] §3 (Method) and §4: The approach relies on the assumption that cross-attention maps plus the Predicted Final Image at an intermediate timestep can reliably predict final object presence, and that a restart with a new seed will succeed with low expected attempts. No ablation or analysis quantifies the precision/recall of this early decision or the average number of restarts needed, leaving the claimed time savings and success-rate gains unverified.

minor comments (2)

[§3] The description of the localization module (object centroids and pairwise relations) is brief; clarify whether it is trained jointly or separately and how it gates generation.
[§3] Notation for the Predicted Final Image and its integration with cross-attention should be formalized with equations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas where additional detail and analysis will strengthen the presentation of our results. We address each major comment below and commit to a major revision that incorporates the requested clarifications, protocols, and analyses.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central claims of a 6-8% increase in complete generations for four-object prompts and up to 32% time reduction are unsupported because the manuscript provides no evaluation protocol, baseline comparisons (e.g., against naive multi-seed sampling or other hallucination mitigation methods), statistical tests, or details on InsideGen construction, train/test splits, or how success (object presence) is measured. This directly affects the load-bearing quantitative results.

Authors: We agree that the current manuscript omits several key details needed to fully substantiate the reported gains. In the revised version we will expand §4 with a complete evaluation protocol section. This will specify: (i) the exact procedure for measuring object presence (automated detection via a fine-tuned YOLO model cross-validated against human annotations on a held-out subset), (ii) the full construction pipeline for InsideGen (prompt template sampling, diffusion model used for synthesis, filtering criteria), (iii) the 80/20 train/test split and any stratification by object count, and (iv) direct comparisons against naive multi-seed sampling (fixed budget of 3–5 seeds) as well as representative hallucination-mitigation baselines from the literature. We will also report statistical significance via bootstrap confidence intervals and paired tests on the per-prompt success rates. These additions will make the 6–8 % and 32 % figures verifiable. revision: yes
Referee: [§3 and §4] §3 (Method) and §4: The approach relies on the assumption that cross-attention maps plus the Predicted Final Image at an intermediate timestep can reliably predict final object presence, and that a restart with a new seed will succeed with low expected attempts. No ablation or analysis quantifies the precision/recall of this early decision or the average number of restarts needed, leaving the claimed time savings and success-rate gains unverified.

Authors: We acknowledge that the manuscript currently lacks explicit quantification of the detector’s reliability and the restart dynamics. In the revision we will add a dedicated ablation subsection that reports precision, recall, and F1 of the early-detection classifier at multiple timesteps (t = 200, 400, 600) on the InsideGen test set, using the final generated image as ground truth. We will also include an empirical distribution of the number of restarts required per prompt, together with a simple expected-time model that combines detection accuracy with the observed success probability of a fresh seed. These results will directly support or qualify the claimed time savings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical ML detector (HEaD+) trained on a newly introduced InsideGen dataset of 45k images. Its central claims are measured via external success criteria (final image matching all prompt objects, generation time) rather than any internal fitted parameter being renamed as a prediction. No equations, self-definitional loops, or load-bearing self-citations appear in the provided text; the approach is a standard supervised classifier whose performance is validated against held-out generation outcomes. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that intermediate diffusion signals predict final object presence; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Cross-attention maps and a predicted final image at intermediate timesteps contain sufficient information to forecast object presence in the completed generation.
Invoked to justify early stopping and restart decisions.

pith-pipeline@v0.9.0 · 5600 in / 1188 out tokens · 86833 ms · 2026-05-10T00:52:44.205000+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 3 canonical work pages

[1]

Campbell, S. L. and Gear, C. W. The index of general nonlinear D A E S. Numer. M ath. 1995

1995
[2]

Slifka, M. K. and Whitton, J. L. Clinical implications of dysregulated cytokine production. J. M ol. M ed. 2000. doi:10.1007/s001090000086

work page doi:10.1007/s001090000086 2000
[3]

Quasimonotonicity, regularity and duality for nonlinear systems of partial differential equations

Hamburger, C. Quasimonotonicity, regularity and duality for nonlinear systems of partial differential equations. Ann. Mat. Pura. Appl. 1995

1995
[4]

Geddes, K. O. and Czapor, S. R. and Labahn, G. Algorithms for C omputer A lgebra. 1992

1992
[5]

Software engineering---from auxiliary to key technologies

Broy, M. Software engineering---from auxiliary to key technologies. Software Pioneers. 1992

1992
[6]

Conductive P olymers. 1981

1981
[7]

Smith, S. E. Neuromuscular blocking drugs in man. Neuromuscular junction. H andbook of experimental pharmacology. 1976

1976
[8]

Chung, S. T. and Morris, R. L. Isolation and characterization of plasmid deoxyribonucleic acid from Streptomyces fradiae. 1978

1978
[9]

and AghaKouchak, A

Hao, Z. and AghaKouchak, A. and Nakhjiri, N. and Farahmand, A. Global integrated drought monitoring and prediction system (GIDMaPS) data sets. 2014

2014
[10]

Babichev, S. A. and Ries, J. and Lvovsky, A. I. Quantum scissors: teleportation of single-mode optical states by means of a nonlocal single photon. 2002

2002
[11]

Wormholes in Maximal Supergravity

Beneke, M. and Buchalla, G. and Dunietz, I. Mixing induced CP asymmetries in inclusive B decays. Phys. L ett. 1997. arXiv:0707.3168

work page Pith review arXiv 1997
[12]

deep SIP : deep learning of S upernova I a P arameters

Stahl, B. deep SIP : deep learning of S upernova I a P arameters. 2020. ascl:2006.023

2020
[13]

Abbott, T. M. C. and others. Dark Energy Survey Year 1 Results: Constraints on Extended Cosmological Models from Galaxy Clustering and Weak Lensing. Phys. Rev. D. 2019. doi:10.1103/PhysRevD.99.123505. arXiv:1810.02499

work page doi:10.1103/physrevd.99.123505 2019
[14]

Goodfellow, Ian and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua , journal = nips, booktitle =
[15]

Zhang, Han and Xu, Tao and Li, Hongsheng and Zhang, Shaoting and Wang, Xiaogang and Huang, Xiaolei and Metaxas, Dimitris N , booktitle=ICCV, year=
[16]

Sauer, Axel and Karras, Tero and Laine, Samuli and Geiger, Andreas and Aila, Timo , booktitle=
[17]

Tao, Ming and Bao, Bing-Kun and Tang, Hao and Xu, Changsheng , booktitle=
[18]

Sohl-Dickstein, Jascha and Weiss, Eric and Maheswaranathan, Niru and Ganguli, Surya , booktitle=icml, year=
[19]

Kingma, Durk P and Salimans, Tim and Jozefowicz, Rafal and Chen, Xi and Sutskever, Ilya and Welling, Max , booktitle=nips, year=
[20]

Ho, Jonathan and Jain, Ajay and Abbeel, Pieter , booktitle=nips, year=
[21]

Dhariwal, Prafulla and Nichol, Alexander , booktitle=nips, year=
[22]

Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark , booktitle=
[23]

Yogesh Balaji and Seungjun Nah and Xun Huang and Arash Vahdat and Jiaming Song and Qinsheng Zhang and Karsten Kreis and Miika Aittala and Timo Aila and Samuli Laine and Bryan Catanzaro and Tero Karras and Ming-Yu Liu , booktitle=
[24]

Saharia, Chitwan and Chan, William and Saxena, Saurabh and Li, Lala and Whang, Jay and Denton, Emily and Ghasemipour, Seyed Kamyar Seyed and Ayan, Burcu Karagol and Mahdavi, S Sara and Lopes, Rapha Gontijo and Salimans, Tim and Ho, Jonathan and Fleet, David J and Norouzi, Mohammad , booktitle=nips, year=
[25]

Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj
[26]

Otani, Mayu and Togashi, Riku and Sawai, Yu and Ishigami, Ryosuke and Nakashima, Yuta and Rahtu, Esa and Heikkil
[27]

Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle=nips, year=
[28]

Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , booktitle=nips, year=
[29]

Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Bras, Ronan Le and Choi, Yejin , booktitle=emnlp, year=
[30]

Lu, Yujie and Yang, Xianjun and Li, Xiujun and Wang, Xin Eric and Wang, William Yang , booktitle=nips, year=
[31]

Brooks, Tim and Holynski, Aleksander and Efros, Alexei A , booktitle=cvpr, year=
[32]

Kawar, Bahjat and Zada, Shiran and Lang, Oran and Tov, Omer and Chang, Huiwen and Dekel, Tali and Mosseri, Inbar and Irani, Michal , booktitle=cvpr, year=
[33]

Gal, Rinon and Patashnik, Or and Maron, Haggai and Bermano, Amit H and Chechik, Gal and Cohen-Or, Daniel , booktitle=
[34]

Nichol, Alexander Quinn and Dhariwal, Prafulla and Ramesh, Aditya and Shyam, Pranav and Mishkin, Pamela and Mcgrew, Bob and Sutskever, Ilya and Chen, Mark , booktitle = icml, year =
[35]

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , author=
[36]

Zhang, Shu and Yang, Xinyi and Feng, Yihao and Qin, Can and Chen, Chia-Chih and Yu, Ning and Chen, Zeyuan and Wang, Huan and Savarese, Silvio and Ermon, Stefano and Xiong, Caiming and Xu, Ran , booktitle=
[37]

Agarwal, Aishwarya and Karanam, Srikrishna and Joseph, KJ and Saxena, Apoorv and Goswami, Koustava and Srinivasan, Balaji Vasan , booktitle=ICCV, year=
[38]

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Tom and Cao, Yuan and Narasimhan, Karthik , booktitle=nips, year=. Tree of
[39]

Li, Yumeng and Keuper, Margret and Zhang, Dan and Khoreva, Anna , booktitle=bmvc, year=
[40]

Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi , year=
[41]

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll
[42]

Cho, Jaemin and Zala, Abhay and Bansal, Mohit , booktitle=ICCV, year=
[43]

Feng, Weixi and He, Xuehai and Fu, Tsu-Jui and Jampani, Varun and Akula, Arjun and Narayana, Pradyumna and Basu, Sugato and Wang, Xin Eric and Wang, William Yang , booktitle=
[44]

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others , booktitle=nips, year=
[45]

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,
[46]

Bakr, Eslam Mohamed and Sun, Pengzhan and Shen, Xiaogian and Khan, Faizan Farooq and Li, Li Erran and Elhoseiny, Mohamed , booktitle=ICCV, year=
[47]

Diffusion

Dhariwal, Prafulla and Nichol, Alexander , booktitle=nips, year=. Diffusion
[48]

Zero-shot

Parmar, Gaurav and Kumar Singh, Krishna and Zhang, Richard and Li, Yijun and Lu, Jingwan and Zhu, Jun-Yan , booktitle=. Zero-shot
[49]

Wang, Zirui and Sha, Zhizhou and Ding, Zheng and Wang, Yilin and Tu, Zhuowen , booktitle=cvpr, year=
[50]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , booktitle=
[51]

Chefer, Hila and Alaluf, Yuval and Vinker, Yael and Wolf, Lior and Cohen-Or, Daniel , booktitle=
[52]

Weixi Feng and Xuehai He and Tsu-Jui Fu and Varun Jampani and Arjun Reddy Akula and Pradyumna Narayana and Sugato Basu and Xin Eric Wang and William Yang Wang , booktitle=iclr, year=
[53]

Chen, Minghao and Laina, Iro and Vedaldi, Andrea , booktitle=wacv, year=
[54]

Liu, Nan and Li, Shuang and Du, Yilun and Torralba, Antonio and Tenenbaum, Joshua B , booktitle=eccv, year=
[55]

Betti, Federico and Staiano, Jacopo and Baraldi, Lorenzo and Baraldi, Lorenzo and Cucchiara, Rita and Sebe, Nicu , booktitle=acmmm, year=
[56]

Tenenbaum and Kfir Aberman and Yael Pritch and Daniel Cohen-Or , booktitle=

Amir Hertz and Ron Mokady and Jay M. Tenenbaum and Kfir Aberman and Yael Pritch and Daniel Cohen-Or , booktitle=
[57]

Wang, Weilun and Bao, Jianmin and Zhou, Wengang and Chen, Dongdong and Chen, Dong and Yuan, Lu and Li, Houqiang , booktitle=
[58]

Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , booktitle=icml, year=
[59]

Helbling, Alec and Montoya, Evan and Chau, Duen Horng , booktitle=
[60]

Reed and Zeynep Akata and Xinchen Yan and Lajanugen Logeswaran and Bernt Schiele and Honglak Lee , booktitle=icml, year=

Scott E. Reed and Zeynep Akata and Xinchen Yan and Lajanugen Logeswaran and Bernt Schiele and Honglak Lee , booktitle=icml, year=
[61]

Generating images of rare concepts using pre-trained diffusion models , author=
[62]

Karthik, Shyamgopal and Roth, Karsten and Mancini, Massimiliano and Akata, Zeynep , booktitle=
[63]

Tenenbaum , booktitle=eccv, year=

Nan Liu and Shuang Li and Yilun Du and Antonio Torralba and Joshua B. Tenenbaum , booktitle=eccv, year=
[64]

Wu, Qiucheng and Liu, Yujian and Zhao, Handong and Bui, Trung and Lin, Zhe and Zhang, Yang and Chang, Shiyu , booktitle=ICCV, year=
[65]

Podell, Dustin and English, Zion and Lacey, Kyle and Blattmann, Andreas and Dockhorn, Tim and M
[66]

Mao, Jiafeng and Wang, Xueting and Aizawa, Kiyoharu , title =
[67]

Royi Rassin and Eran Hirsch and Daniel Glickman and Shauli Ravfogel and Yoav Goldberg and Gal Chechik , booktitle=nips, year=
[68]

Zhang, Yifan and Kang, Bingyi and Hooi, Bryan and Yan, Shuicheng and Feng, Jiashi , booktitle=ieeetpami, year=
[69]

Bakr, Eslam Mohamed and Sun, Pengzhan and Shen, Xiaoqian and Khan, Faizan Farooq and Li, Li Erran and Elhoseiny, Mohamed , title =
[70]

Samuel, Dvir and Ben-Ari, Rami and Darshan, Nir and Maron, Haggai and Chechik, Gal , booktitle=nips, year=
[71]

Hu, Yushi and Liu, Benlin and Kasai, Jungo and Wang, Yizhong and Ostendorf, Mari and Krishna, Ranjay and Smith, Noah A , booktitle=iccv, year=
[72]

Minderer, Matthias and Gritsenko, Alexey and Houlsby, Neil , booktitle=nips, year=
[73]

Luping Liu and Yi Ren and Zhijie Lin and Zhou Zhao , booktitle=iclr, year=
[74]

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , booktitle=cvpr, year=
[75]

Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others , booktitle=iclr, year=
[76]

Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas , booktitle=
[77]

and Welling, Max , booktitle = iclr, title =

Kingma, Diederik P. and Welling, Max , booktitle = iclr, title =
[78]

Viescore: Towards explainable metrics for conditional image synthesis evaluation , author=
[79]

Singh, Jaskirat and Zheng, Liang , booktitle=nips, year=
[80]

Evaluating Text-to-Visual Generation with Image-to-Text Generation , author=

Showing first 80 references.