pith. sign in

arxiv: 2605.23518 · v1 · pith:QQHIFXCCnew · submitted 2026-05-22 · 💻 cs.CV

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

Pith reviewed 2026-05-25 04:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords imageeditingvins-120kdatasetinstructionlarge-scaletextureaesthetic
0
0 comments X

The pith

VINS-120K supplies 120K curated triplets for instruction-driven editing of images at resolutions above 4K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the first large-scale dataset of 120,000 instruction-input-edited image triplets where every image is at least 4096 by 4096 pixels. It builds this collection through a multi-stage filtering process that enforces visual quality, instruction match, and aesthetic standards. The authors also introduce a post-adaptation technique that makes existing lower-resolution editing models handle high-frequency textures at ultra-high resolution. A new benchmark called VINS-4KEval is provided to measure performance across editing types in this regime. If the claims hold, the work removes a primary data barrier that has prevented realistic instruction-based editing at professional resolutions.

Core claim

The central claim is that a rigorously filtered collection of 120K instruction-aligned triplets at >=4K resolution, combined with a high-frequency-aware post-adaptation method, enables pretrained models to synthesize finer details and more realistic textures when editing ultra-high-resolution images.

What carries the argument

The VINS-120K dataset of instruction-image-edited triplets at ultra-high resolution together with the high-frequency-aware post-adaptation strategy that extends lower-resolution models.

If this is right

  • Adapted models produce higher-fidelity detail synthesis and texture realism on UHR edits than the same models without the adaptation step.
  • VINS-4KEval supplies a standardized way to compare different editing approaches across many instruction types at consistent high resolution.
  • The post-adaptation approach allows reuse of existing non-UHR pretrained weights rather than training new models from scratch at full resolution.
  • Instruction-based editing becomes feasible for applications that require output resolutions of 4096 by 4096 or greater.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset construction pipeline could be reused to generate similar high-quality pairs for related tasks such as high-resolution image generation or restoration.
  • Future models might combine the adaptation strategy with progressive training schedules to reach even higher resolutions without proportional compute growth.
  • The benchmark could reveal whether current metrics adequately capture perceptual quality differences at ultra-high resolutions.
  • Extending the same curation logic to video sequences might address temporal consistency in high-resolution editing.
  • keywords:[

Load-bearing premise

The multi-stage curation pipeline produces triplets that are verifiably high-quality, instruction-aligned, and free of systematic artifacts or biases.

What would settle it

An experiment in which models adapted with VINS-120K show no measurable gain in fine detail or texture metrics over strong baselines when tested on a fresh set of real-world 4K+ images with human-written instructions.

Figures

Figures reproduced from arXiv: 2605.23518 by En Ci, Jian Yang, Shanyan Guan, Wei Li, Yanhao Ge, Ying Tai, Zhanxin Gao, Zhenyu Zhang, Zhizhou Chen.

Figure 1
Figure 1. Figure 1: Comparison at ultra-high-resolution editing: From left to right are the input image, our edited result, and the edited image [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview and visualized examples of edited triplets (instruction, input image, edited image) for each edit type in VINS-120K. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Filtered examples of video frames. Purple indicates similar frames (high CLIP score), while blue shows frames with semantic misalignment (high optical flow). video, we first segment it into multiple clips with consis￾tent content using PySceneDetect [5]. Next, we extract frames from each clip and combine them into candidate im￾age pairs. Finally, we compute semantic similarity using CLIP Score [34] and mot… view at source ↗
Figure 4
Figure 4. Figure 4: Data Filtering Pipeline. We filter images sequentially for corruption, low quality, inconsistent instructions, and poor aes￾thetics, retaining only 20% of the highest-quality data. Image Quality Filtering In this stage, we filter high￾resolution images exhibiting low visual quality through multi-dimensional filters: • Structural Clarity: We compute the Tenengrad gradient magnitude [22] to measure edge shar… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of image statistics between AnyEdit [ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons on the VINS-4KEval benchmark. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison with Seedream 4.0. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation on attention-score rescaling. Blue: with rescal [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: , omitting RoPE rescaling hinders the ability to adapt to positional encodings beyond the pretraining range, re￾sulting in semantic drift or severe local repetition. These indicate that naive UHR scaling is insufficient and that ded￾icated post-adaptation is necessary for high-quality ultra￾high-resolution image editing. More detailed ablation stud￾ies are provided in the supplementary material. Effect of … view at source ↗
Figure 10
Figure 10. Figure 10: Loss curves of models trained without rescaling at [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of editing types in VINS-120K across dif [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison on multi-turn editing evaluation. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparisons on out-of-domain editing eval [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative results on different base model: QwenImage-Edit-2511. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Editing failure examples. G. Limitations and Feature Work Our method still has certain limitations in text editing [12]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Spectral density analysis of the generated images. [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of image statistics between AnyEdit [ [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Examples of Editing Instruction Annotation performed by our pipeline. [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: More high-quality examples from VINS-120K Dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: More qualitative comparison (1/2) between our method and recent baselines (Seedream4.0 [ [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: More qualitative comparison (2/2) between our method and recent baselines (Seedream4.0 [ [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗
read the original abstract

Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution ($\geq$4096 $\times$ 4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. Built on VINS-120K, we further develop a high-frequency-aware post-adaptation strategy to extend pretrained non-high-resolution models to the UHR regime. We also present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work improves fine-grained detail synthesis and texture realism in UHR image editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces VINS-120K, the first large-scale dataset of 120K instruction-based triplets (instruction, input image, edited image) for ultra-high-resolution (UHR) image editing where every image exceeds 4K resolution. It describes a multi-stage curation pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity; proposes a high-frequency-aware post-adaptation strategy to extend pretrained models to the UHR regime; and presents the VINS-4KEval benchmark. Experiments are claimed to confirm gains in fine-grained detail synthesis and texture realism.

Significance. If the curation pipeline produces verifiably high-quality, instruction-aligned triplets and the reported gains are reproducible against baselines, the dataset and adaptation method could meaningfully advance UHR editing research by addressing the current scarcity of suitable training data and evaluation protocols.

major comments (2)
  1. [Abstract] Abstract: the central claim that the dataset and post-adaptation strategy improve fine-grained detail synthesis and texture realism is asserted without any reported metrics, baselines, ablation studies, or evaluation protocol, rendering the claim impossible to assess.
  2. [Dataset construction paragraph] Dataset construction paragraph: the multi-stage curation pipeline is presented as producing high-quality, instruction-aligned triplets free of systematic artifacts, yet no quantitative validation (human preference scores, alignment accuracy, or artifact statistics) is supplied; this validation is load-bearing for isolating claimed texture-realism gains from data artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below. Where the comments identify gaps in the current manuscript, we agree that revisions are warranted and will incorporate the suggested additions in the revised version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the dataset and post-adaptation strategy improve fine-grained detail synthesis and texture realism is asserted without any reported metrics, baselines, ablation studies, or evaluation protocol, rendering the claim impossible to assess.

    Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript contains an Experiments section that reports results on VINS-4KEval, including comparisons against baselines and ablations demonstrating gains in detail synthesis and texture realism. To address the concern directly, we will revise the abstract to include a concise summary of the key metrics (e.g., improvements in perceptual quality and alignment scores) and reference the evaluation protocol. revision: yes

  2. Referee: [Dataset construction paragraph] Dataset construction paragraph: the multi-stage curation pipeline is presented as producing high-quality, instruction-aligned triplets free of systematic artifacts, yet no quantitative validation (human preference scores, alignment accuracy, or artifact statistics) is supplied; this validation is load-bearing for isolating claimed texture-realism gains from data artifacts.

    Authors: The current manuscript describes the pipeline stages but does not include quantitative validation of the curation outcomes. We concur that such validation is important to substantiate the quality claims and to separate data effects from the adaptation method. We will add a dedicated paragraph or table in the revised manuscript reporting human preference studies (e.g., alignment accuracy and artifact rates) and any available automated statistics from the filtering stages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset contribution with no derivations or self-referential predictions

full rationale

The paper presents VINS-120K as a curated dataset of 120K triplets and a high-frequency-aware post-adaptation strategy, supported by experiments on VINS-4KEval. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The multi-stage curation pipeline is described as an input process rather than a derived result, and downstream claims rest on empirical outcomes rather than any reduction to self-defined quantities or self-citations. This matches the default case of a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Contribution rests on the empirical claim that a filtered 120K triplet collection plus a high-frequency adaptation step yields measurable gains; no mathematical free parameters, new physical entities, or formal axioms are introduced.

axioms (1)
  • domain assumption A multi-stage filtering pipeline can reliably produce instruction-aligned, high-aesthetic UHR editing triplets without introducing curation artifacts.
    Invoked in the dataset construction paragraph of the abstract.

pith-pipeline@v0.9.0 · 5722 in / 1212 out tokens · 26043 ms · 2026-05-25T04:19:23.987167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 14 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 1, 2

  3. [3]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 2

  4. [4]

    Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

    Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 4

  5. [5]

    Pyscenedetect.https://www

    Brandon Castellano. Pyscenedetect.https://www. scenedetect . com. Video Cut Detection and Anal- ysis Tool. Available at:https : / / github . com / Breakthrough / PySceneDetect. BSD-3-Clause Li- cense. 3

  6. [6]

    Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

    Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 28188–28197, 2025. 2, 3, 4

  7. [7]

    Ragd: Regional-aware diffusion model for text-to-image generation

    Zhennan Chen, Yajie Li, Haofan Wang, Zhibo Chen, Zhengkai Jiang, Jun Li, Qian Wang, Jian Yang, and Ying Tai. Ragd: Regional-aware diffusion model for text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19331–19341, 2025. 5

  8. [8]

    arXiv preprint arXiv:2511.18822 (2025)

    Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xi- aobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025. 5

  9. [9]

    Describe, don’t dic- tate: Semantic image editing with natural language intent

    En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, and Ying Tai. Describe, don’t dic- tate: Semantic image editing with natural language intent. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19185–19194, 2025. 1

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 3

  11. [11]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 6, 4, 9, 10

  12. [12]

    Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

    Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025. 5

  13. [13]

    I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow

    Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow. 2024. 5

  14. [14]

    Guiding instruction-based im- age editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023

    Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based im- age editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023. 2

  15. [15]

    Seed-data-edit technical report: A hybrid dataset for in- structional image editing.arXiv preprint arXiv:2405.04007,

    Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for in- structional image editing.arXiv preprint arXiv:2405.04007,

  16. [16]

    Textural features for image classification.IEEE Transactions on systems, man, and cybernetics, (6):610–621,

    Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein. Textural features for image classification.IEEE Transactions on systems, man, and cybernetics, (6):610–621,

  17. [17]

    Hiflow: Generating diverse hi maps and inferring cosmology while marginalizing over as- trophysics using normalizing flows.The Astrophysical Jour- nal, 937(2):83, 2022

    Sultan Hassan, Francisco Villaescusa-Navarro, Benjamin Wandelt, David N Spergel, Daniel Angl ´es-Alc´azar, Shy Genel, Miles Cranmer, Greg L Bryan, Romeel Dav ´e, Rachel S Somerville, et al. Hiflow: Generating diverse hi maps and inferring cosmology while marginalizing over as- trophysics using normalizing flows.The Astrophysical Jour- nal, 937(2):83, 2022. 6

  18. [18]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 1, 2

  19. [19]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6, 1

  20. [20]

    Smartedit: Exploring complex instruction-based image editing with multimodal large lan- guage models

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large lan- guage models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362– 8371, 2024. 2

  21. [21]

    Hq-edit: A high-quality dataset for instruction-based image editing

    Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990, 2024. 4

  22. [22]

    Focusing.International Journal of Computer Vision, 1(3):223–237, 1988

    Eric Krotkov. Focusing.International Journal of Computer Vision, 1(3):223–237, 1988. 4

  23. [23]

    Viescore: Towards explainable metrics for conditional image synthesis evaluation

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12268–12290, 2024. 2, 6

  24. [24]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2

  25. [25]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  26. [26]

    2, 5, 6, 7, 8, 1, 4, 9, 10

  27. [27]

    Balancing preservation and modification: A region and semantic aware metric for instruction-based image editing.arXiv preprint arXiv:2506.13827, 2025

    Zhuoying Li, Zhu Xu, Yuxin Peng, and Yang Liu. Balancing preservation and modification: A region and semantic aware metric for instruction-based image editing.arXiv preprint arXiv:2506.13827, 2025. 4

  28. [28]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 5

  29. [29]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 1, 6, 4, 9, 10

  30. [30]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6, 1

  31. [31]

    X2edit: Revisiting arbitrary- instruction image editing through self-constructed data and task-aware representation learning

    Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, and Haonan Lu. X2edit: Revisiting arbitrary- instruction image editing through self-constructed data and task-aware representation learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7764– 7772, 2026. 2, 3, 4, 5, 6

  32. [32]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 2

  33. [33]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023. 5

  34. [34]

    Cinescale: Free lunch in high-resolution cinematic visual generation.arXiv preprint arXiv:2508.15774, 2025

    Haonan Qiu, Ning Yu, Ziqi Huang, Paul Debevec, and Zi- wei Liu. Cinescale: Free lunch in high-resolution cinematic visual generation.arXiv preprint arXiv:2508.15774, 2025. 5

  35. [35]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  36. [36]

    Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024

    Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024. 2, 6

  37. [37]

    Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

    Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024. 3

  38. [38]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 4

  39. [39]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 2, 6, 4, 9, 10

  40. [40]

    Emu edit: Precise image editing via recognition and gen- eration tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and gen- eration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871– 8879, 2024. 1

  41. [41]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  42. [42]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020. 3

  43. [43]

    Plug-and-play diffusion features for text-driven image-to-image translation

    Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023. 2

  44. [44]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

  45. [45]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

  46. [46]

    Omniedit: Building image edit- ing generalist models through specialist supervision

    Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image edit- ing generalist models through specialist supervision. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 2, 4

  47. [47]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 3

  48. [48]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 5, 8

  49. [49]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 6

  50. [50]

    Ultravideo: High-quality uhd video dataset with comprehensive captions.arXiv preprint arXiv:2506.13691, 2025

    Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, Yuxuan Cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, et al. Ultravideo: High-quality uhd video dataset with comprehensive captions.arXiv preprint arXiv:2506.13691, 2025. 4

  51. [51]

    Echo-4o: Harnessing the power of gpt- 4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

    Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zheng- hao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt- 4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025. 2, 3, 4

  52. [52]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A uni- fied image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025. 2, 4, 6, 1

  53. [53]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26125–26135, 2025. 1, 2, 4, 5, 6

  54. [54]

    Big bird: Transformers for longer sequences.Advances in neu- ral information processing systems, 33:17283–17297, 2020

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neu- ral information processing systems, 33:17283–17297, 2020. 5

  55. [55]

    Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

    Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464– 23473, 2025. 2

  56. [56]

    Designing a practical degradation model for deep blind im- age super-resolution

    Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind im- age super-resolution. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4791–4800,

  57. [57]

    Enabling instructional image editing with in-context genera- tion in large scale diffusion transformer

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context genera- tion in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems, 2025. 1, 2, 6, 4, 9, 10

  58. [58]

    Ultrahr-100k: En- hancing uhr image synthesis with a large-scale high-quality dataset.arXiv preprint arXiv:2510.20661, 2025

    Chen Zhao, En Ci, Yunzhe Xu, Tiehan Fan, Shanyan Guan, Yanhao Ge, Jian Yang, and Ying Tai. Ultrahr-100k: En- hancing uhr image synthesis with a large-scale high-quality dataset.arXiv preprint arXiv:2510.20661, 2025. 2, 6

  59. [59]

    Vision-to-Edit

    Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024. 2, 4 VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset Supplementa...