pith. machine review for the scientific record. sign in

arxiv: 2604.26348 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.AI

Recognition: unknown

ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords diffusion modelsperceptual optimizationno-reference image quality assessmentanchor regularizationimage generationfine-tuning stability
0
0 comments X

The pith

Anchor regularization lets diffusion models adopt no-reference perceptual guidance without instability or fidelity loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion models are usually trained with pixel-wise matching to ground-truth images, an objective that can leave subjective visual appeal and semantic alignment short of what people prefer. Adding guidance from no-reference image quality assessment models directly creates a mismatch with the diffusion training goal, which produces unstable fine-tuning and unwanted shifts in the output distribution. The paper shows that an anchor term enforcing the same noise predictions as the original base model resolves enough of that mismatch to let perceptual signals steer the model toward higher-quality outputs. The resulting adaptation improves appearance and consistency while keeping the original generative behavior, diversity, and training stability intact.

Core claim

The authors establish that an anchor-constrained optimization framework, which pairs a learned no-reference image quality assessment signal with regularization that matches the fine-tuned model's noise predictions to those of the unfine-tuned base model, produces stable perceptual adaptation in diffusion models.

What carries the argument

Anchor-based regularization on noise prediction, which keeps the adapted model consistent with the base diffusion model while the no-reference quality assessment model supplies the perceptual guidance signal.

If this is right

  • Perceptual quality of generated images rises while generation diversity is retained.
  • Training stability holds and no large distributional drift occurs.
  • The adapted model remains consistent with the original generative behavior.
  • Controlled movement toward perceptually preferred outputs becomes possible without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring idea could be tested on other guidance signals, such as explicit semantic-consistency losses, to see whether stability benefits generalize.
  • Applying the method to video or 3D diffusion models would test whether noise-prediction consistency also preserves temporal or geometric coherence.
  • If the anchor proves robust across different no-reference quality models, it suggests many perception-based objectives could be added to diffusion training under similar constraints.

Load-bearing premise

The anchor regularization term is strong enough to eliminate the mismatch between perceptual optimization and the original diffusion objective.

What would settle it

A controlled fine-tuning experiment in which the anchor term is removed and no increase in instability or distributional drift appears, or in which the anchored version shows no gain in perceptual quality metrics or human preference, would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.26348 by Feifan Meng, Han Fang, Weiming Zhang, Yang Yang.

Figure 1
Figure 1. Figure 1: Teaser. Comparison between the baseline model view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Anchor-Constrained Perceptual Optimization (ACPO) framework. The framework integrates diffusion view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison between the baseline DDPM and the proposed quality-aware diffusion model across multiple view at source ↗
Figure 4
Figure 4. Figure 4: The visual results of text-to-image generation between the baseline Stable Diffusion and our improved model. For view at source ↗
Figure 5
Figure 5. Figure 5: Extended qualitative comparison between the baseline DDPM and our ACPO model across multiple datasets: CIFAR-10 view at source ↗
Figure 6
Figure 6. Figure 6: Visual demonstration of improved semantic alignment on Stable Diffusion. Specific text prompts are indicated by the view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on the DrawBench benchmark. For each comparison, the baseline Stable Diffusion result is shown view at source ↗
Figure 8
Figure 8. Figure 8: Generalization results on the PartiPrompts benchmark across various common object categories. Each pair consists view at source ↗
Figure 9
Figure 9. Figure 9: Generalization results on the PartiPrompts benchmark across various common object categories. Each pair consists view at source ↗
read the original abstract

Diffusion models have achieved remarkable success in image generation, yet their training is predominantly driven by full-reference objectives that enforce pixel-wise similarity to ground-truth images.Such supervision, while effective for fidelity, may insufficient in terms of subjective visual perception quality and text-image semantic consistency. In this work, we investigate the problem of incorporating no-reference perceptual quality into diffusion training. A key challenge is that directly optimizing perceptual signals, such as those provided by no-reference image quality assessment (NR-IQA) models, introduces a mismatch with the original diffusion objective, leading to training instability and distributional drift during fine-tuning. To address this issue, we propose an anchor-constrained optimization framework that enables stable perceptual adaptation. Specifically, we leverage a learned NR-IQA model as a perceptual guidance signal, while introducing an anchor-based regularization that enforces consistency with the base diffusion model in terms of noise prediction. This design effectively balances perceptual quality improvement and generative fidelity, allowing controlled adaptation toward perceptually favorable outputs without compromising the original generative behavior. Extensive experiments demonstrate that our method consistently enhances perceptual quality while preserving generation diversity and training stability, highlighting the effectiveness of anchor-constrained perceptual optimization for diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an anchor-constrained perceptual optimization (ACPO) framework for fine-tuning diffusion models. It uses a learned no-reference image quality assessment (NR-IQA) model to provide perceptual guidance while adding an anchor-based regularization term that enforces consistency between the fine-tuned model's noise predictions and those of the frozen base model. The combined objective is claimed to enable stable adaptation that improves perceptual quality without distributional drift or loss of generative fidelity, with extensive experiments demonstrating consistent gains in perceptual metrics while preserving diversity and stability.

Significance. If the central stability claim holds with proper supporting analysis, the work would offer a practical and generalizable approach to incorporating perceptual signals into diffusion training, addressing a recognized mismatch between pixel-wise diffusion objectives and subjective quality. The anchor regularization idea is a clear strength that could extend to other generative fine-tuning settings. However, the absence of theoretical grounding or isolating ablations reduces the immediate significance relative to prior perceptual guidance methods in diffusion literature.

major comments (2)
  1. [Method section] Method section (combined objective L = L_diff + λ L_anchor + μ L_NR-IQA): the claim that the anchor term fully mitigates distributional drift and training instability induced by direct NR-IQA gradients lacks any supporting analysis, such as bounds on the NR-IQA model's Lipschitz constant or perturbation analysis on the score function. This is load-bearing for the central claim of stable perceptual adaptation.
  2. [Experiments section] Experiments section: improved perceptual scores with stable FID are reported, yet no ablation varies the NR-IQA weighting μ while holding the anchor term fixed (or vice versa). Without such isolation, it remains possible that observed stability results from weak perceptual gradients rather than robust mitigation by the anchor, undermining the cross-setting claims.
minor comments (1)
  1. [Abstract] Abstract: the statement that the method 'consistently enhances perceptual quality while preserving generation diversity' would be strengthened by naming the specific NR-IQA model, datasets, and key metrics (e.g., FID, CLIP score) used in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the anchor regularization approach. We respond to each major comment below with clarifications based on the manuscript's empirical focus and propose targeted revisions to address the concerns.

read point-by-point responses
  1. Referee: [Method section] Method section (combined objective L = L_diff + λ L_anchor + μ L_NR-IQA): the claim that the anchor term fully mitigates distributional drift and training instability induced by direct NR-IQA gradients lacks any supporting analysis, such as bounds on the NR-IQA model's Lipschitz constant or perturbation analysis on the score function. This is load-bearing for the central claim of stable perceptual adaptation.

    Authors: We agree that theoretical supporting analysis such as Lipschitz bounds on the NR-IQA model or perturbation analysis on the score function would add rigor. Deriving such bounds is challenging because the NR-IQA model is a pre-trained black-box network. Our manuscript instead grounds the stability claim in extensive empirical results across multiple models and datasets, where the anchor term maintains stable FID, diversity, and training dynamics even as perceptual quality improves. The anchor explicitly regularizes noise predictions toward the frozen base model, which we show prevents the drift seen in direct NR-IQA optimization. We will revise the method section to expand the discussion of this empirical evidence and the design intuition behind the anchor without overstating theoretical guarantees. revision: partial

  2. Referee: [Experiments section] Experiments section: improved perceptual scores with stable FID are reported, yet no ablation varies the NR-IQA weighting μ while holding the anchor term fixed (or vice versa). Without such isolation, it remains possible that observed stability results from weak perceptual gradients rather than robust mitigation by the anchor, undermining the cross-setting claims.

    Authors: We acknowledge the value of these isolating ablations. While our experiments vary the overall objective balance and report consistent stability, we did not explicitly fix λ and sweep μ (or the reverse). We will add these targeted ablations to the revised experiments section, demonstrating that the anchor enables stable use of higher μ values and that stability degrades without it. This will directly support the cross-setting claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposed composite objective is a design choice with independent content

full rationale

The paper defines a combined training objective L = L_diff + λ L_anchor + μ L_NR-IQA that adds an anchor regularization term to the standard diffusion loss and an NR-IQA perceptual term. This is presented as an engineering solution to balance objectives rather than a derivation that reduces any claimed prediction or result to a fitted parameter or self-referential quantity by construction. No equations are shown that make the anchor term equivalent to the input data or that rename a known result. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the provided text. The framework builds on external NR-IQA models and frozen base diffusion checkpoints, keeping the central claim self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; ledger entries reflect components explicitly invoked in the abstract description.

axioms (1)
  • domain assumption Learned NR-IQA models supply reliable perceptual guidance signals suitable for guiding diffusion fine-tuning.
    The method treats the NR-IQA output as a usable training signal without further validation in the abstract.
invented entities (1)
  • Anchor-based regularization term no independent evidence
    purpose: Enforce consistency between the fine-tuned model's noise predictions and those of the base diffusion model to prevent instability and drift.
    Introduced specifically to balance perceptual adaptation against the original generative objective.

pith-pipeline@v0.9.0 · 5511 in / 1385 out tokens · 77921 ms · 2026-05-07T13:59:12.048250+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 42 canonical work pages · 5 internal anchors

  1. [1]

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhong- dao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. 2024. PixArt- Σ: Weak-to- Strong Training of Diffusion Transformer for 4K Text-to-Image Generation.arXiv Preprint(2024). arXiv:2403.04692 [cs.CV] doi:10.48550/arXiv.2403.04692

  2. [2]

    Simoncelli

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. 2022. Image Quality Assessment: Unifying Structure and Texture Similarity.IEEE Transactions on Pattern Analysis and Machine Intelligence44, 5 (2022), 2567–2581. doi:10.1109/ TPAMI.2020.3045810

  3. [3]

    Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chaojie Mao, Chenwei Xie, Yu Liu, and Jingren Zhou. 2024. ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer. arXiv:2410.00086 [cs.CV] doi:10.48550/ arXiv.2410.00086

  4. [4]

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning.arXiv abs/2104.08718 (2021). https://arxiv.org/abs/2104.08718

  5. [5]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems (NeurIPS ’20). Cur- ran Associates Inc., Red Hook, NY, USA, 6840–6851. doi:10.5555/3495724.3496298

  6. [6]

    Fleet, Mohammad Norouzi, and Tim Salimans

    Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. 2022. Cascaded Diffusion Models for High Fidelity Image Generation.Journal of Machine Learning Research23, 47 (2022), 1–33. https: //www.jmlr.org/papers/v23/21-0635.html

  7. [7]

    Auto-Encoding Variational Bayes

    Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In International Conference on Learning Representations. arXiv:1312.6114 [stat.ML] doi:10.48550/arXiv.1312.6114

  8. [8]

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-Pic: An Open Dataset of User Preferences for Text-to- Image Generation. InAdvances in Neural Information Processing Systems, Vol. 36. arXiv:2305.01569 [cs.CV] doi:10.48550/arXiv.2305.01569

  9. [9]

    Moez Krichen. 2023. Generative Adversarial Networks. In2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT). 1–7. doi:10.1109/ICCCNT56998.2023.10306417

  10. [10]

    International Journal of Computer Vision (IJCV)123(1), 32–73 (2017),https: //link.springer.com/article/10.1007/S11263-016-0981-7

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.International Journal of Computer Vision123, 1 (2017), 32–73. doi:10.1007/s11...

  11. [11]

    2009.Learning Multiple Layers of Features from Tiny Images

    Alex Krizhevsky. 2009.Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto. https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf

  12. [12]

    Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. 2023. AGIQA-1K: A Perceptual Quality Assessment Exploration for AIGC Images.arXiv(2023). doi:10.48550/arXiv.2303.12618

  13. [13]

    Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. 2024. AGIQA-3K: An Open Database for AI- Generated Image Quality Assessment.IEEE Transactions on Circuits and Systems for Video Technology34, 9 (2024), 6833–6846. doi:10.1109/TCSVT.2023.3319020

  14. [14]

    Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. 2024. AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 6327–6336. doi:10.1109/ CVPRW63382.2024.006327

  15. [15]

    Qiang Li, Qingsen Yan, Haojian Huang, Peng Wu, Haokui Zhang, and Yanning Zhang. 2025. Text-Visual Semantic Constrained AI-Generated Image Quality As- sessment. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). ACM, New York, NY, USA, 6958–6966. doi:10.1145/3746027.3755471

  16. [16]

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

  17. [17]

    Jupo Ma, Jinjian Wu, Leida Li, Weisheng Dong, and Xuemei Xie. 2020. Active Inference of GAN for No-Reference Image Quality Assessment. In2020 IEEE Inter- national Conference on Multimedia and Expo (ICME). 1–6. doi:10.1109/ICME46284. 2020.9102895

  18. [18]

    Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. 2012. No- Reference Image Quality Assessment in the Spatial Domain.IEEE Trans. Image Process.21, 12 (2012), 4695–4708. doi:10.1109/TIP.2012.2214050

  19. [19]

    Com- pletely Blind

    Anish Mittal, Rajiv Soundararajan, and Alan C. Bovik. 2013. Making a “Com- pletely Blind” Image Quality Analyzer.IEEE Signal Process. Lett.20, 3 (2013), 209–212. doi:10.1109/LSP.2012.2227726

  20. [20]

    Fei Peng, Huiyuan Fu, Anlong Ming, Chuanming Wang, Huadong Ma, Shuai He, Zifei Dou, and Shu Chen. 2024. AIGC Image Quality Assessment via Image- Prompt Correspondence. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 6432–6441. doi:10.1109/CVPRW.2024.00653

  21. [21]

    Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Represen- tation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations. arXiv:1511.06434 [cs.LG] doi:10.48550/arXiv.1511.06434

  22. [22]

    Soumik Rakshit. 2018. Anime Faces Dataset. https://www.kaggle.com/ soumikrakshit/anime-faces Kaggle Dataset

  23. [23]

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. InAdvances in Neural Information Processing Systems, Vol. 35. arXiv:2204.06125 [cs.CV] doi:10.48550/arXiv.2204.06125

  24. [24]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695. doi:10.1109/CVPR52688.2022.01061

  25. [25]

    Sara Mah- davi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mah- davi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. InAdvances in Neural Informatio...

  26. [26]

    Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffu- sion Implicit Models. InInternational Conference on Learning Representations. arXiv:2010.02502 [cs.LG] doi:10.48550/arXiv.2010.02502

  27. [28]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochas- tic Differential Equations. InAdvances in Neural Information Processing Systems, Vol. 34. arXiv:2011.13456 [cs.LG] doi:10.48550/arXiv.2011.13456

  28. [29]

    Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. 2020. Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network. InIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. 3667–3676. doi:10.1109/CVPR42600.2020.00374 Yang et al. ACPO: Anchor-Constrained Perceptual Optimization

  29. [30]

    Yu Tian, Yue Liu, Shiqi Wang, and Sam Kwong. 2025. Quality Assessment for Text-to-Image Generation: A Survey.IEEE MultiMedia32, 2 (2025), 44–52. doi:10.1109/MMUL.2025.3538862

  30. [31]

    Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. Pixel Recurrent Neural Networks. InInternational Conference on Machine Learning. 1747–1756. arXiv:1601.06759 [cs.CV] doi:10.48550/arXiv.1601.06759

  31. [32]

    Jiarui Wang, Huiyu Duan, Jing Liu, Shi Chen, Xiongkuo Min, and Guangtao Zhai. 2023. AIGCIQA2023: A Large-scale Image Quality Assessment Database for AI Generated Images: from the Perspectives of Quality, Authenticity and Correspondence.arXiv(2023). doi:10.48550/arXiv.2307.00211

  32. [33]

    IEEE Trans

    Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE Trans. Image Process.13, 4 (2004), 600–612. doi:10.1109/TIP.2003.819861

  33. [34]

    Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau

    Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. 2023. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models.arXiv2210.14896 (2023). https://arxiv.org/ abs/2210.14896

  34. [35]

    Wufeng Xue, Lei Zhang, Xuanqin Mou, and Alan C. Bovik. 2014. Gradient Magnitude Similarity Deviation: A Highly Efficient Perceptual Image Quality Index.IEEE Trans. Image Process.23, 2 (2014), 684–695. doi:10.1109/TIP.2013. 2293423

  35. [36]

    Qingsen Yan, Dong Gong, and Yanning Zhang. 2019. Two-Stream Convolutional Networks for Blind Image Quality Assessment.IEEE Trans. Image Process.28, 5 (2019), 2200–2211. doi:10.1109/TIP.2018.2883741

  36. [37]

    Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, and Tao Mei. 2024. Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). ACM, New York, NY, USA, 6870–6879. doi:10.1145/ 3664647.3681634

  37. [38]

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. 2022. MANIQA: Multi-Dimension At- tention Network for No-Reference Image Quality Assessment. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 1191–1200. doi:10.1109/CVPRW56347.2022.00126

  38. [39]

    Junyong You and Jari Korhonen. 2021. Transformer For Image Quality Assess- ment. In2021 IEEE International Conference on Image Processing (ICIP). 1389–1393. doi:10.1109/ICIP42928.2021.9506169

  39. [40]

    Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. 2015. Con- struction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/1506.03365

  40. [41]

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. 2022. Scaling Autoregressive Models for Content-Rich Text- to-Image Generation.Transactions on Machine Learning Research2022 ...

  41. [42]

    Zihao Yu, Fengbin Guan, Yiting Lu, Xin Li, and Zhibo Chen. 2024. SF-IQA: Quality and Similarity Integration for AI Generated Image Quality Assessment. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 6692–6701. doi:10.1109/CVPRW.2024.00679

  42. [43]

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models. InIEEE/CVF International Conference on Computer Vision. 3788–3798. doi:10.1109/ICCV51070.2023.00383

  43. [44]

    Lin Zhang, Ying Shen, and Hongyu Li. 2014. VSI: A Visual Saliency-Induced Index for Perceptual Image Quality Assessment.IEEE Trans. Image Process.23, 10 (2014), 4270–4281. doi:10.1109/TIP.2014.2346028

  44. [45]

    Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. 2011. FSIM: A Feature Similarity Index for Image Quality Assessment.IEEE Trans. Image Process.20, 8 (2011), 2378–2386. doi:10.1109/TIP.2011.2109730

  45. [46]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang

  46. [47]

    polishing

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition. 586–595. doi:10.1109/CVPR.2018.58 ACPO: Anchor-Constrained Perceptual Optimization Yang et al. Supplementary Material This supplementary material complements the main paper by pro- viding an expanded set of visual evalu...