pith. sign in

arxiv: 2605.20147 · v1 · pith:2O2HHZADnew · submitted 2026-05-19 · 💻 cs.CV

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

Pith reviewed 2026-05-20 05:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords ultra-high-resolutiontext-to-image generation100MP imagesimage datasettraining schemesimage quality evaluationsemantic alignmentT2I models
0
0 comments X

The pith

A dataset of 95,000 ultra-high-resolution images enables text-to-image models to generate at native 100-megapixel resolution through three training schemes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PixVerve-95K, a collection of 95,000 images each containing at least 100 million pixels, gathered from varied scenarios and equipped with seven types of annotations. It shows how this dataset supports the adaptation of multiple existing text-to-image models to produce 100MP images directly, rather than relying on later enlargement steps, by testing three different training approaches. The authors also build PixVerve-Bench, an evaluation set that combines conventional image metrics with assessments from large multimodal models to check both visual fidelity and how well outputs match input text. This addresses the core shortage of suitable high-resolution training material that has limited progress toward detailed, large-scale generated images. A reader would care because successful extension to 100MP would bring AI image creation closer to the detail levels used in professional photography and large-format displays.

Core claim

By curating the PixVerve-95K dataset of 95K images at minimum 100MP resolution with seven-dimensional annotations and applying three training schemes to various T2I foundation models, native 100MP generation is shown to be feasible, as validated by the PixVerve-Bench protocol that measures both visual quality and semantic alignment using standard metrics and multimodal large language model judgments.

What carries the argument

The PixVerve-95K dataset, consisting of 95,000 images each with at least 100 million pixels and seven-dimensional annotations, paired with three training schemes that adapt text-to-image foundation models for direct native 100MP output.

If this is right

  • Existing text-to-image models can reach native 100MP output without depending on separate upsampling stages.
  • The three training schemes supply concrete ways to manage the added complexity of ultra-high-resolution content during adaptation.
  • PixVerve-Bench supplies a repeatable protocol for judging both visual quality and prompt alignment at these resolutions.
  • Experimental comparisons across schemes yield practical guidance on data use and training choices for higher-resolution work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the dataset generalizes, the same curation approach could scale to create training sets for resolutions beyond 100MP.
  • The results imply that targeted high-quality data collection may matter more than major model redesigns when increasing output resolution.
  • Similar techniques could transfer to related tasks such as high-resolution video generation or domain-specific imagery like medical scans.

Load-bearing premise

The curated PixVerve-95K dataset is assumed to contain sufficiently diverse, high-quality, and correctly annotated ultra-high-resolution content that generalizes beyond the specific collection pipeline used to build it.

What would settle it

Generate 100MP images from the adapted models on text prompts describing scenes or objects poorly represented in the 95K dataset; if the outputs exhibit visible artifacts, loss of coherence, or weaker text alignment than lower-resolution baselines, the central claim would be challenged.

read the original abstract

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce PixVerve-95K, a high-quality open-source UHR T2I dataset with 95K images of at least 100MP each and seven-dimensional annotations, curated via a carefully designed pipeline. It extends various T2I foundation models to native 100MP generation using three training schemes and establishes the PixVerve-Bench benchmark for comprehensive evaluation of UHR images using conventional metrics and MLLM-based assessments. The work provides extensive experimental results and insights for future UHR generation breakthroughs.

Significance. If the results hold, this would be a significant contribution to the field of text-to-image generation by enabling native ultra-high-resolution outputs, which is currently limited. The large-scale dataset and benchmark could serve as valuable resources for the community, promoting further advancements in handling high-resolution content. The empirical exploration of training strategies is a strength if they prove effective beyond the specific dataset.

major comments (3)
  1. [Abstract] The assertion that the curated PixVerve-95K dataset contains sufficiently diverse, high-quality, and correctly annotated ultra-high-resolution content is central to the paper's claims. However, no quantitative checks on annotation accuracy, inter-annotator agreement, or out-of-pipeline generalization are reported, which is critical for validating that the benchmark gains are due to the training schemes rather than data-specific artifacts.
  2. [Training Schemes] Details on the three training schemes are provided, but the manuscript lacks specific information on how they handle the computational challenges of 100MP images, such as memory efficiency or resolution-specific adaptations, making it difficult to assess the stability of native generation.
  3. [PixVerve-Bench] The benchmark is described as using multimodal large language model-based assessments, but the specific MLLMs employed and the validation of their assessments against human judgments should be detailed to ensure the reliability of the evaluation protocol.
minor comments (2)
  1. [Abstract] Consider replacing 'pioneering step' with a less hyperbolic term to align with standard academic tone.
  2. Verify that all acronyms are defined at first use and that the reference list is complete for prior work on high-resolution image generation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, providing clarifications and indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The assertion that the curated PixVerve-95K dataset contains sufficiently diverse, high-quality, and correctly annotated ultra-high-resolution content is central to the paper's claims. However, no quantitative checks on annotation accuracy, inter-annotator agreement, or out-of-pipeline generalization are reported, which is critical for validating that the benchmark gains are due to the training schemes rather than data-specific artifacts.

    Authors: We agree that quantitative validation would further substantiate the dataset quality claims. In the revised manuscript, we have added a dedicated subsection in the data curation pipeline section reporting annotation accuracy on a manually verified sample of 2,000 images, inter-annotator agreement via Fleiss' kappa scores from multiple annotators on a 500-image subset, and out-of-pipeline generalization results on an external set of 1,000 UHR images. These additions confirm that performance gains stem from the training schemes rather than dataset artifacts. revision: yes

  2. Referee: [Training Schemes] Details on the three training schemes are provided, but the manuscript lacks specific information on how they handle the computational challenges of 100MP images, such as memory efficiency or resolution-specific adaptations, making it difficult to assess the stability of native generation.

    Authors: The referee correctly notes the need for more granular implementation details. We have revised the training schemes section to explicitly describe our approaches to computational challenges, including the use of DeepSpeed ZeRO-3 for distributed memory optimization, activation checkpointing to reduce peak memory, and a progressive resolution adaptation strategy that initializes at 4K before scaling to native 100MP. These details demonstrate training stability and feasibility on standard high-end hardware. revision: yes

  3. Referee: [PixVerve-Bench] The benchmark is described as using multimodal large language model-based assessments, but the specific MLLMs employed and the validation of their assessments against human judgments should be detailed to ensure the reliability of the evaluation protocol.

    Authors: We acknowledge the importance of specifying the evaluation components for reproducibility. The revised manuscript now details the exact MLLMs employed (GPT-4V and LLaVA-1.5) and includes a new validation subsection reporting results from a human study on 300 images, where MLLM scores were compared against averaged human ratings, yielding a Pearson correlation of 0.87. This supports the reliability of the MLLM-based protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and training contribution is self-contained

full rationale

The paper introduces a new UHR dataset (PixVerve-95K) curated via a described pipeline, applies three training schemes to extend existing T2I models, and evaluates on a new benchmark (PixVerve-Bench). No equations, first-principles derivations, or fitted parameters are presented that reduce claimed performance to quantities defined by or fitted on the same inputs used for evaluation. The contribution is empirical and procedural rather than a closed mathematical chain; reported gains are attributed to experimental outcomes on held-out or constructed benchmarks, with no self-definitional loops, renamed predictions, or load-bearing self-citations that collapse the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that high-resolution photographic and artistic content can be reliably collected, filtered, and annotated at scale without introducing systematic biases that would prevent generalization to generated images.

axioms (1)
  • domain assumption Existing T2I foundation models can be fine-tuned or adapted to much higher native resolutions without fundamental architectural changes.
    Invoked when the authors state they extend various T2I models to 100MP generation.

pith-pipeline@v0.9.0 · 5788 in / 1244 out tokens · 28973 ms · 2026-05-20T05:14:39.791020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 12 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 6, 7

  2. [2]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 6

  3. [3]

    Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

    Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025. 2, 3, 10, 11

  4. [4]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025. 1

  5. [5]

    ArtiMuse: Fine-grained image aesthetics assessment with joint scoring and expert-level understanding

    Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. Artimuse: Fine-grained image aesthetics assessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 5, 7

  6. [6]

    Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 3, 4, 8

  7. [7]

    Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025

    Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025. 3, 9

  8. [8]

    L2P: Unlocking Latent Potential for Pixel Generation

    Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Jiawei Chen, Zhuoqi Zeng, Wei Zhang, Chengjie Wang, Jian Yang, and Ying Tai. L2p: Unlocking latent potential for pixel generation.arXiv preprint arXiv:2605.12013,

  9. [9]

    Unsplash.https://unsplash.com/images, 2013

    Mikael Cho. Unsplash.https://unsplash.com/images, 2013. 4, 17

  10. [10]

    Notes on the resolution and other details of the human eye.Clarkvision

    Roger N Clark. Notes on the resolution and other details of the human eye.Clarkvision. com, 2005. 1

  11. [11]

    Demofusion: Democratising high-resolution image generation with no$

    Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no$. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6159–6168, 2024. 2, 3, 10, 11, 19

  12. [12]

    I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow

    Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow. 2024. 2

  13. [13]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 3 13

  14. [14]

    One-step diffusion transformer for controllable real-world image super-resolution.arXiv preprint arXiv:2511.17138,

    Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, and Yanfeng Wang. One-step diffusion transformer for controllable real-world image super-resolution.arXiv preprint arXiv:2511.17138,

  15. [15]

    Generative adversarial nets.Advances in neural information processing systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 3

  16. [16]

    Gemini.https://gemini.google.com/, 2025

    Google. Gemini.https://gemini.google.com/, 2025. 5

  17. [17]

    Textural features for image classification

    Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein. Textural features for image classification. IEEE Transactions on systems, man, and cybernetics, (6):610–621, 2007. 9, 19

  18. [18]

    Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

    Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Conference on Learning Representations, 2023. 3

  19. [19]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021. 2, 10

  20. [20]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 2, 9

  21. [21]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  22. [22]

    Ultragen: High-resolution video generation with hierarchical attention

    Teng Hu, Jiangning Zhang, Zihan Su, and Ran Yi. Ultragen: High-resolution video generation with hierarchical attention. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4923–4931, 2026. 1

  23. [23]

    Fouriscale: A frequency perspective on training-free high-resolution image synthesis

    Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. InEuropean conference on computer vision, pages 196–212. Springer, 2024. 2, 3

  24. [24]

    Open-set image tagging with multi-grained text supervision

    Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, and Lei Zhang. Open-set image tagging with multi-grained text supervision. InProceedings of the 33rd ACM International Conference on Multimedia, pages 4117–4126, 2025. 6

  25. [25]

    Pexels images.https://www.pexels.com/images/, 2014

    Ingo, Bruno Joseph, and Daniel Frese. Pexels images.https://www.pexels.com/images/, 2014. 4, 17

  26. [26]

    arXiv preprint arXiv:2510.12798 (2025)

    Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025. 6

  27. [27]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023. 3

  28. [28]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. 3

  29. [29]

    FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025. 1, 3, 8, 10, 11

  30. [30]

    aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022

    LAION-AI. aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022. 5, 9

  31. [31]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 3, 9

  32. [32]

    arXiv preprint arXiv:2409.10695 , year=

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024. 3

  33. [33]

    Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024

    Songhua Liu, Weihao Yu, Zhenxiong Tan, and Xinchao Wang. Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024. 10, 11

  34. [34]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 3

  35. [35]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 3 14

  36. [36]

    Qwen3.5: Towards Native Multimodal Agents.https://qwen.ai/blog?id=qwen3.5, February 2026

    Qwen Team. Qwen3.5: Towards Native Multimodal Agents.https://qwen.ai/blog?id=qwen3.5, February 2026. 9, 10, 20, 21

  37. [37]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021. 3

  38. [38]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 6

  39. [39]

    Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024

    Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024. 3, 10, 11

  40. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  41. [41]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021. 3

  42. [42]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294,

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294,

  43. [43]

    A mathematical theory of communication.The Bell system technical journal, 27(3): 379–423, 1948

    Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3): 379–423, 1948. 5, 19

  44. [44]

    Resmaster: Mastering high-resolution image generation via structural and fine-grained guidance

    Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mastering high-resolution image generation via structural and fine-grained guidance. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6887–6895, 2025. 3

  45. [45]

    Freeu: Free lunch in diffusion u-net

    Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4733–4743, 2024. 3

  46. [46]

    Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis

    Luigi Sigillo, Shengfeng He, and Danilo Comminiello. Latent wavelet diffusion for ultra-high-resolution image synthesis.arXiv preprint arXiv:2506.00433, 2025. 3

  47. [47]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 5

  48. [48]

    Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

    Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion. arXiv preprint arXiv:2507.23268, 2025. 3

  49. [49]

    Image-compression-benchmark

    WangXuan95. Image-compression-benchmark. https://github.com/WangXuan95/ Image-Compression-Benchmark, 2025. 23

  50. [50]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 5, 10, 11

  51. [51]

    Fg-clip 2: A bilingual fine-grained vision-language alignment model.arXiv preprint arXiv:2510.10921, 2025

    Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, and Yuhui Yin. Fg-clip 2: A bilingual fine-grained vision-language alignment model.arXiv preprint arXiv:2510.10921, 2025. 10

  52. [52]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024. 1

  53. [53]

    Ultravideo: High-quality uhd video dataset with comprehensive captions.arXiv preprint arXiv:2506.13691,

    Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, Yuxuan Cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, et al. Ultravideo: High-quality uhd video dataset with comprehensive captions.arXiv preprint arXiv:2506.13691, 2025. 1

  54. [54]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6 15

  55. [55]

    Ultraflux: Data-model co-design for high-quality native 4k text-to-image generation across diverse aspect ratios.arXiv preprint arXiv:2511.18050, 2025

    Tian Ye, Song Fei, and Lei Zhu. Ultraflux: Data-model co-design for high-quality native 4k text-to-image generation across diverse aspect ratios.arXiv preprint arXiv:2511.18050, 2025. 2, 3, 4, 6, 10, 11

  56. [56]

    Emov2: Pushing 5 m vision model frontier.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Jiangning Zhang, Teng Hu, Haoyang He, Zhucun Xue, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, and Dacheng Tao. Emov2: Pushing 5 m vision model frontier.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 8

  57. [57]

    One-step diffusion with distribution matching distillation

    Jiangning Zhang, Junwei Zhu, Teng Hu, Yabiao Wang, Donghao Luo, Weijian Cao, Zhenye Gan, Xiaobin Hu, Zhucun Xue, and Chengjie Wang. Transform trained transformer: Accelerating naive 4k video generation over 10×.arXiv preprint arXiv:2512.13492, 2025. 1, 9

  58. [58]

    Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

    Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Ultra-high-resolution image synthesis: Data, method and evaluation.arXiv preprint arXiv:2506.01331, 2025. 2, 3, 4, 8, 17

  59. [59]

    Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

    Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025. 2, 3, 4, 8, 10, 11

  60. [60]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 6

  61. [61]

    Ultrahr-100k: Enhancing uhr image synthesis with a large-scale high-quality dataset.arXiv preprint arXiv:2510.20661, 2025

    Chen Zhao, En Ci, Yunzhe Xu, Tiehan Fan, Shanyan Guan, Yanhao Ge, Jian Yang, and Ying Tai. Ultrahr-100k: Enhancing uhr image synthesis with a large-scale high-quality dataset.arXiv preprint arXiv:2510.20661, 2025. 2, 3, 4, 6, 8, 17

  62. [62]

    4kagent: agentic any image to 4k super-resolution.arXiv preprint arXiv:2507.07105,

    Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V Wang, James Zou, et al. 4kagent: agentic any image to 4k super-resolution.arXiv preprint arXiv:2507.07105,

  63. [63]

    textureless

    1 16 Appendix The appendix presents the following sections to strengthen the main manuscript: —Sec. Aprovides implementation details of flatness detection. —Sec. Bprovides a further frequency-domain analysis to confirm the quality of PixVerve-95K. — Sec. Cprovides a detailed clarification on the licensing for our proposed dataset to ensure transparency an...

  64. [67]

    self-lit

    **Rigor:** Apply strict criteria and any noticeable artifact should be reflected in the scoring. Maintain a high standard for what constitutes a “5” (Excellent). #EVALUATION RUBRICS: ##1. Structural Coherence (SC-global) Check whether the geometric structure of the entities is correct, whether there are any missing or redundant limbs, and whether the over...

  65. [71]

    SC-global

    Keys: “SC-global”, “PI”, “LC”, “CH” represent the scores for the 4 dimensions respectively, and “reasoning” is a concise explanation justifying the scores. Ensure the JSON property names are enclosed in double quotes and there are no trailing commas in the JSON object. #OUTPUT FORMAT: <json> {{ “SC-global”: int, “PI”: int, “LC”: int, “CH”: int, “reasoning...

  66. [72]

    **Objectivity & Fairness:** Maintain an objective stance throughout the evaluation process and base your judgement on visual evidence with the same standard instead of subjective preference

  67. [73]

    Score based on the visual quality and fidelity aspects solely

    **Focus Solely on Fidelity:** Consider the image category and expected characteristics while avoiding any bias towards the content of the image. Score based on the visual quality and fidelity aspects solely

  68. [74]

    intended bokeh/blur

    **Local-to-Global Evaluation:** Evaluate the details in Image 1, and use Image 2 to distinguish between “intended bokeh/blur” and “accidental artifacts”. 26

  69. [75]

    **Coordinates Reference:** Use the rectangular bounding box only to understand the local patch’s location in the overall image context, but DO NOT directly compare the local patch to the global image for pixel-level details

  70. [76]

    **Independence:** Evaluate each dimension independently without any halo effects

  71. [77]

    melting” or “waxy

    **Rigor:** Apply strict criteria and any noticeable artifact should be reflected in the scoring. Maintain a high standard for what constitutes a “5” (Excellent). #EVALUATION RUBRICS: Please evaluate the microscopic details and fidelity of the **Local Patch (Image 1)** across the 5 dimensions below, while using the Global Image (Image 2) and the relative c...

  72. [78]

    You MUST follow a strict 5-point scale and provide a score as an **INTEGER from 1 to 5 only** for each dimension

  73. [81]

    NGE”, “GA

    Keys: “NGE”, “GA”, “TF”, “MGC”, “SC-local” represent the scores for the 5 dimensions respectively, and “reasoning” is a concise explanation justifying the scores. Ensure the JSON property names are enclosed in double quotes and there are no trailing commas in the JSON object. #OUTPUT FORMAT: <json> {{ “NGE”: int, “GA”: int, “TF”: int, “MGC”: int, 27 “SC-l...

  74. [82]

    Focus strictly and solely on presence or absence rather than quality

    **IEV (Instance Existence Verification):** Inspect whether all instances explicitly mentioned in the long caption are present. Focus strictly and solely on presence or absence rather than quality

  75. [83]

    This requires detailed cross-referencing between the caption and the visual content

    **AAA (Appearance Attribute Alignment):** For each instance that exists, assess whether its visual attributes (color, texture, material, size, shape) align with the description in the long caption. This requires detailed cross-referencing between the caption and the visual content

  76. [84]

    #CRITICAL SCORING RULES (Must Strictly Follow):

    **SRA (Spatial Relation Accuracy):** Evaluate whether the relative positioning (e.g., left/right, top/bottom, fore- ground/background) and the logical perspective between multiple instances are accurately depicted in the image. #CRITICAL SCORING RULES (Must Strictly Follow):

  77. [85]

    **Hierarchical Dependence:** **IEV** is the gatekeeper. If any critical instance is missing (IEV below 4), the corresponding AAA and SRA for the image must be penalized accordingly, as attributes and relations cannot exist without the entity

  78. [86]

    **Detail Awareness:** Since this is a high-resolution image evaluation task, you must meticulously scan **the entire canvas**, including corners and background, to identify all mentioned instances and their micro-details

  79. [87]

    making choices

    **Strict Adherence to Explicit Constraints:** Judge the image ONLY based on what is explicitly stated in the long caption. Do not impose imaginary constraints or personal aesthetic preferences. For any visual aspects NOT mentioned (e.g., specific lighting, background nuances, or artistic style), the generation model is allowed creative autonomy. Do not pe...

  80. [88]

    **Hallucination Penalty:** If the synthesized image contains prominent instances that are NOT mentioned in the long caption and significantly distract from the caption’s content (severe hallucination), deduct 1-2 points from **IEV**

Showing first 80 references.