VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

En Ci; Jian Yang; Shanyan Guan; Wei Li; Yanhao Ge; Ying Tai; Zhanxin Gao; Zhenyu Zhang; Zhizhou Chen

arxiv: 2605.23518 · v1 · pith:QQHIFXCCnew · submitted 2026-05-22 · 💻 cs.CV

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

Zhizhou Chen , Shanyan Guan , Zhanxin Gao , En Ci , Yanhao Ge , Wei Li , Zhenyu Zhang , Jian Yang

show 1 more author

Ying Tai

This is my paper

Pith reviewed 2026-05-25 04:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords imageeditingvins-120kdatasetinstructionlarge-scaletextureaesthetic

0 comments

The pith

VINS-120K supplies 120K curated triplets for instruction-driven editing of images at resolutions above 4K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the first large-scale dataset of 120,000 instruction-input-edited image triplets where every image is at least 4096 by 4096 pixels. It builds this collection through a multi-stage filtering process that enforces visual quality, instruction match, and aesthetic standards. The authors also introduce a post-adaptation technique that makes existing lower-resolution editing models handle high-frequency textures at ultra-high resolution. A new benchmark called VINS-4KEval is provided to measure performance across editing types in this regime. If the claims hold, the work removes a primary data barrier that has prevented realistic instruction-based editing at professional resolutions.

Core claim

The central claim is that a rigorously filtered collection of 120K instruction-aligned triplets at >=4K resolution, combined with a high-frequency-aware post-adaptation method, enables pretrained models to synthesize finer details and more realistic textures when editing ultra-high-resolution images.

What carries the argument

The VINS-120K dataset of instruction-image-edited triplets at ultra-high resolution together with the high-frequency-aware post-adaptation strategy that extends lower-resolution models.

If this is right

Adapted models produce higher-fidelity detail synthesis and texture realism on UHR edits than the same models without the adaptation step.
VINS-4KEval supplies a standardized way to compare different editing approaches across many instruction types at consistent high resolution.
The post-adaptation approach allows reuse of existing non-UHR pretrained weights rather than training new models from scratch at full resolution.
Instruction-based editing becomes feasible for applications that require output resolutions of 4096 by 4096 or greater.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset construction pipeline could be reused to generate similar high-quality pairs for related tasks such as high-resolution image generation or restoration.
Future models might combine the adaptation strategy with progressive training schedules to reach even higher resolutions without proportional compute growth.
The benchmark could reveal whether current metrics adequately capture perceptual quality differences at ultra-high resolutions.
Extending the same curation logic to video sequences might address temporal consistency in high-resolution editing.
keywords:[

Load-bearing premise

The multi-stage curation pipeline produces triplets that are verifiably high-quality, instruction-aligned, and free of systematic artifacts or biases.

What would settle it

An experiment in which models adapted with VINS-120K show no measurable gain in fine detail or texture metrics over strong baselines when tested on a fresh set of real-world 4K+ images with human-written instructions.

Figures

Figures reproduced from arXiv: 2605.23518 by En Ci, Jian Yang, Shanyan Guan, Wei Li, Yanhao Ge, Ying Tai, Zhanxin Gao, Zhenyu Zhang, Zhizhou Chen.

**Figure 1.** Figure 1: Comparison at ultra-high-resolution editing: From left to right are the input image, our edited result, and the edited image [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: An overview and visualized examples of edited triplets (instruction, input image, edited image) for each edit type in VINS-120K. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Filtered examples of video frames. Purple indicates similar frames (high CLIP score), while blue shows frames with semantic misalignment (high optical flow). video, we first segment it into multiple clips with consistent content using PySceneDetect [5]. Next, we extract frames from each clip and combine them into candidate image pairs. Finally, we compute semantic similarity using CLIP Score [34] and mot… view at source ↗

**Figure 4.** Figure 4: Data Filtering Pipeline. We filter images sequentially for corruption, low quality, inconsistent instructions, and poor aesthetics, retaining only 20% of the highest-quality data. Image Quality Filtering In this stage, we filter highresolution images exhibiting low visual quality through multi-dimensional filters: • Structural Clarity: We compute the Tenengrad gradient magnitude [22] to measure edge shar… view at source ↗

**Figure 5.** Figure 5: Comparison of image statistics between AnyEdit [ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons on the VINS-4KEval benchmark. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison with Seedream 4.0. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation on attention-score rescaling. Blue: with rescal [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: , omitting RoPE rescaling hinders the ability to adapt to positional encodings beyond the pretraining range, resulting in semantic drift or severe local repetition. These indicate that naive UHR scaling is insufficient and that dedicated post-adaptation is necessary for high-quality ultrahigh-resolution image editing. More detailed ablation studies are provided in the supplementary material. Effect of … view at source ↗

**Figure 10.** Figure 10: Loss curves of models trained without rescaling at [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of editing types in VINS-120K across dif [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison on multi-turn editing evaluation. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparisons on out-of-domain editing eval [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative results on different base model: QwenImage-Edit-2511. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Editing failure examples. G. Limitations and Feature Work Our method still has certain limitations in text editing [12]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Spectral density analysis of the generated images. [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Comparison of image statistics between AnyEdit [ [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Examples of Editing Instruction Annotation performed by our pipeline. [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 19.** Figure 19: More high-quality examples from VINS-120K Dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

**Figure 20.** Figure 20: More qualitative comparison (1/2) between our method and recent baselines (Seedream4.0 [ [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: More qualitative comparison (2/2) between our method and recent baselines (Seedream4.0 [ [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗

read the original abstract

Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution ($\geq$4096 $\times$ 4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. Built on VINS-120K, we further develop a high-frequency-aware post-adaptation strategy to extend pretrained non-high-resolution models to the UHR regime. We also present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work improves fine-grained detail synthesis and texture realism in UHR image editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VINS-120K supplies a new 120K-scale dataset for instruction-based 4K+ image editing, but the abstract reports zero metrics on data quality or results.

read the letter

The paper's core offering is VINS-120K, a collection of 120K instruction-input-edit triplets where every image is at least 4096x4096. It also includes a new benchmark, VINS-4KEval, and a post-adaptation step meant to push existing models into the high-resolution regime. This scale and resolution combination has not appeared before for this exact task, so the resource itself is the main novelty on the page. The curation description is concrete: a multi-stage pipeline that filters for visual quality, instruction match, and aesthetics. That level of detail on construction is useful for anyone who might want to replicate or extend the effort. The abstract positions the work as directly addressing the data shortage that has kept ultra-high-res editing underdeveloped. That framing is accurate and points to a real practical gap in media and design applications. The soft spot is straightforward. The abstract asserts that the triplets are high-quality and instruction-aligned and that the adaptation improves fine-grained detail and texture realism, yet it contains no human preference scores, no alignment accuracy numbers, no artifact counts, and no baseline or ablation tables. Without those, the quality claim rests entirely on the pipeline description. If the full paper supplies the missing quantitative checks and shows they hold up, the dataset becomes a usable contribution. If not, downstream claims about better synthesis cannot be isolated from possible curation artifacts. This is the kind of paper that matters to groups building or fine-tuning generative editing models who need high-resolution training pairs. A reader who already works in that area would get value from the scale and the benchmark once the validation numbers are in place. The thinking is coherent on its own terms and engages the literature gap honestly. A serious editor should send it to peer review so the full experiments and data checks can be examined; the resource angle alone justifies the time even if revisions are needed on the evaluation side.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces VINS-120K, the first large-scale dataset of 120K instruction-based triplets (instruction, input image, edited image) for ultra-high-resolution (UHR) image editing where every image exceeds 4K resolution. It describes a multi-stage curation pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity; proposes a high-frequency-aware post-adaptation strategy to extend pretrained models to the UHR regime; and presents the VINS-4KEval benchmark. Experiments are claimed to confirm gains in fine-grained detail synthesis and texture realism.

Significance. If the curation pipeline produces verifiably high-quality, instruction-aligned triplets and the reported gains are reproducible against baselines, the dataset and adaptation method could meaningfully advance UHR editing research by addressing the current scarcity of suitable training data and evaluation protocols.

major comments (2)

[Abstract] Abstract: the central claim that the dataset and post-adaptation strategy improve fine-grained detail synthesis and texture realism is asserted without any reported metrics, baselines, ablation studies, or evaluation protocol, rendering the claim impossible to assess.
[Dataset construction paragraph] Dataset construction paragraph: the multi-stage curation pipeline is presented as producing high-quality, instruction-aligned triplets free of systematic artifacts, yet no quantitative validation (human preference scores, alignment accuracy, or artifact statistics) is supplied; this validation is load-bearing for isolating claimed texture-realism gains from data artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below. Where the comments identify gaps in the current manuscript, we agree that revisions are warranted and will incorporate the suggested additions in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the dataset and post-adaptation strategy improve fine-grained detail synthesis and texture realism is asserted without any reported metrics, baselines, ablation studies, or evaluation protocol, rendering the claim impossible to assess.

Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript contains an Experiments section that reports results on VINS-4KEval, including comparisons against baselines and ablations demonstrating gains in detail synthesis and texture realism. To address the concern directly, we will revise the abstract to include a concise summary of the key metrics (e.g., improvements in perceptual quality and alignment scores) and reference the evaluation protocol. revision: yes
Referee: [Dataset construction paragraph] Dataset construction paragraph: the multi-stage curation pipeline is presented as producing high-quality, instruction-aligned triplets free of systematic artifacts, yet no quantitative validation (human preference scores, alignment accuracy, or artifact statistics) is supplied; this validation is load-bearing for isolating claimed texture-realism gains from data artifacts.

Authors: The current manuscript describes the pipeline stages but does not include quantitative validation of the curation outcomes. We concur that such validation is important to substantiate the quality claims and to separate data effects from the adaptation method. We will add a dedicated paragraph or table in the revised manuscript reporting human preference studies (e.g., alignment accuracy and artifact rates) and any available automated statistics from the filtering stages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset contribution with no derivations or self-referential predictions

full rationale

The paper presents VINS-120K as a curated dataset of 120K triplets and a high-frequency-aware post-adaptation strategy, supported by experiments on VINS-4KEval. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The multi-stage curation pipeline is described as an input process rather than a derived result, and downstream claims rest on empirical outcomes rather than any reduction to self-defined quantities or self-citations. This matches the default case of a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Contribution rests on the empirical claim that a filtered 120K triplet collection plus a high-frequency adaptation step yields measurable gains; no mathematical free parameters, new physical entities, or formal axioms are introduced.

axioms (1)

domain assumption A multi-stage filtering pipeline can reliably produce instruction-aligned, high-aesthetic UHR editing triplets without introducing curation artifacts.
Invoked in the dataset construction paragraph of the abstract.

pith-pipeline@v0.9.0 · 5722 in / 1212 out tokens · 26043 ms · 2026-05-25T04:19:23.987167+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 14 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 1, 2

work page 2023
[3]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 2

work page 2023
[4]

Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 4

work page arXiv 2025
[5]

Pyscenedetect.https://www

Brandon Castellano. Pyscenedetect.https://www. scenedetect . com. Video Cut Detection and Anal- ysis Tool. Available at:https : / / github . com / Breakthrough / PySceneDetect. BSD-3-Clause Li- cense. 3

work page
[6]

Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 28188–28197, 2025. 2, 3, 4

work page 2025
[7]

Ragd: Regional-aware diffusion model for text-to-image generation

Zhennan Chen, Yajie Li, Haofan Wang, Zhibo Chen, Zhengkai Jiang, Jun Li, Qian Wang, Jian Yang, and Ying Tai. Ragd: Regional-aware diffusion model for text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19331–19341, 2025. 5

work page 2025
[8]

arXiv preprint arXiv:2511.18822 (2025)

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xi- aobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025. 5

work page arXiv 2025
[9]

Describe, don’t dic- tate: Semantic image editing with natural language intent

En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, and Ying Tai. Describe, don’t dic- tate: Semantic image editing with natural language intent. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19185–19194, 2025. 1

work page 2025
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 6, 4, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025. 5

work page arXiv 2025
[13]

I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow

Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow. 2024. 5

work page 2024
[14]

Guiding instruction-based im- age editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based im- age editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023. 2

work page arXiv 2023
[15]

Seed-data-edit technical report: A hybrid dataset for in- structional image editing.arXiv preprint arXiv:2405.04007,

Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for in- structional image editing.arXiv preprint arXiv:2405.04007,

work page arXiv
[16]

Textural features for image classification.IEEE Transactions on systems, man, and cybernetics, (6):610–621,

Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein. Textural features for image classification.IEEE Transactions on systems, man, and cybernetics, (6):610–621,

work page
[17]

Hiflow: Generating diverse hi maps and inferring cosmology while marginalizing over as- trophysics using normalizing flows.The Astrophysical Jour- nal, 937(2):83, 2022

Sultan Hassan, Francisco Villaescusa-Navarro, Benjamin Wandelt, David N Spergel, Daniel Angl ´es-Alc´azar, Shy Genel, Miles Cranmer, Greg L Bryan, Romeel Dav ´e, Rachel S Somerville, et al. Hiflow: Generating diverse hi maps and inferring cosmology while marginalizing over as- trophysics using normalizing flows.The Astrophysical Jour- nal, 937(2):83, 2022. 6

work page 2022
[18]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6, 1

work page 2022
[20]

Smartedit: Exploring complex instruction-based image editing with multimodal large lan- guage models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large lan- guage models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362– 8371, 2024. 2

work page 2024
[21]

Hq-edit: A high-quality dataset for instruction-based image editing

Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990, 2024. 4

work page arXiv 2024
[22]

Focusing.International Journal of Computer Vision, 1(3):223–237, 1988

Eric Krotkov. Focusing.International Journal of Computer Vision, 1(3):223–237, 1988. 4

work page 1988
[23]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12268–12290, 2024. 2, 6

work page 2024
[24]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2

work page 2024
[25]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

2, 5, 6, 7, 8, 1, 4, 9, 10

work page
[27]

Balancing preservation and modification: A region and semantic aware metric for instruction-based image editing.arXiv preprint arXiv:2506.13827, 2025

Zhuoying Li, Zhu Xu, Yuxin Peng, and Yang Liu. Balancing preservation and modification: A region and semantic aware metric for instruction-based image editing.arXiv preprint arXiv:2506.13827, 2025. 4

work page arXiv 2025
[28]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 1, 6, 4, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

X2edit: Revisiting arbitrary- instruction image editing through self-constructed data and task-aware representation learning

Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, and Haonan Lu. X2edit: Revisiting arbitrary- instruction image editing through self-constructed data and task-aware representation learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7764– 7772, 2026. 2, 3, 4, 5, 6

work page 2026
[32]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Cinescale: Free lunch in high-resolution cinematic visual generation.arXiv preprint arXiv:2508.15774, 2025

Haonan Qiu, Ning Yu, Ziqi Huang, Paul Debevec, and Zi- wei Liu. Cinescale: Free lunch in high-resolution cinematic visual generation.arXiv preprint arXiv:2508.15774, 2025. 5

work page arXiv 2025
[35]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021
[36]

Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024. 2, 6

work page 2024
[37]

Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024. 3

work page arXiv 2024
[38]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 4

work page 2022
[39]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 2, 6, 4, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Emu edit: Precise image editing via recognition and gen- eration tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and gen- eration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871– 8879, 2024. 1

work page 2024
[41]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[42]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020. 3

work page 2020
[43]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023. 2

work page 1921
[44]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

work page 2017
[45]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

work page 1905
[46]

Omniedit: Building image edit- ing generalist models through specialist supervision

Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image edit- ing generalist models through specialist supervision. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 2, 4

work page 2024
[47]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 3

work page 2022
[48]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Ultravideo: High-quality uhd video dataset with comprehensive captions.arXiv preprint arXiv:2506.13691, 2025

Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, Yuxuan Cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, et al. Ultravideo: High-quality uhd video dataset with comprehensive captions.arXiv preprint arXiv:2506.13691, 2025. 4

work page arXiv 2025
[51]

Echo-4o: Harnessing the power of gpt- 4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zheng- hao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt- 4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025. 2, 3, 4

work page arXiv 2025
[52]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A uni- fied image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025. 2, 4, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26125–26135, 2025. 1, 2, 4, 5, 6

work page 2025
[54]

Big bird: Transformers for longer sequences.Advances in neu- ral information processing systems, 33:17283–17297, 2020

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neu- ral information processing systems, 33:17283–17297, 2020. 5

work page 2020
[55]

Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464– 23473, 2025. 2

work page 2025
[56]

Designing a practical degradation model for deep blind im- age super-resolution

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind im- age super-resolution. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4791–4800,

work page
[57]

Enabling instructional image editing with in-context genera- tion in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context genera- tion in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems, 2025. 1, 2, 6, 4, 9, 10

work page 2025
[58]

Ultrahr-100k: En- hancing uhr image synthesis with a large-scale high-quality dataset.arXiv preprint arXiv:2510.20661, 2025

Chen Zhao, En Ci, Yunzhe Xu, Tiehan Fan, Shanyan Guan, Yanhao Ge, Jian Yang, and Ying Tai. Ultrahr-100k: En- hancing uhr image synthesis with a large-scale high-quality dataset.arXiv preprint arXiv:2510.20661, 2025. 2, 6

work page arXiv 2025
[59]

Vision-to-Edit

Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024. 2, 4 VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset Supplementa...

work page 2024

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 1, 2

work page 2023

[3] [3]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 2

work page 2023

[4] [4]

Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 4

work page arXiv 2025

[5] [5]

Pyscenedetect.https://www

Brandon Castellano. Pyscenedetect.https://www. scenedetect . com. Video Cut Detection and Anal- ysis Tool. Available at:https : / / github . com / Breakthrough / PySceneDetect. BSD-3-Clause Li- cense. 3

work page

[6] [6]

Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 28188–28197, 2025. 2, 3, 4

work page 2025

[7] [7]

Ragd: Regional-aware diffusion model for text-to-image generation

Zhennan Chen, Yajie Li, Haofan Wang, Zhibo Chen, Zhengkai Jiang, Jun Li, Qian Wang, Jian Yang, and Ying Tai. Ragd: Regional-aware diffusion model for text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19331–19341, 2025. 5

work page 2025

[8] [8]

arXiv preprint arXiv:2511.18822 (2025)

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xi- aobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025. 5

work page arXiv 2025

[9] [9]

Describe, don’t dic- tate: Semantic image editing with natural language intent

En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, and Ying Tai. Describe, don’t dic- tate: Semantic image editing with natural language intent. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19185–19194, 2025. 1

work page 2025

[10] [10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 6, 4, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025. 5

work page arXiv 2025

[13] [13]

I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow

Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow. 2024. 5

work page 2024

[14] [14]

Guiding instruction-based im- age editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based im- age editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023. 2

work page arXiv 2023

[15] [15]

Seed-data-edit technical report: A hybrid dataset for in- structional image editing.arXiv preprint arXiv:2405.04007,

Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for in- structional image editing.arXiv preprint arXiv:2405.04007,

work page arXiv

[16] [16]

Textural features for image classification.IEEE Transactions on systems, man, and cybernetics, (6):610–621,

Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein. Textural features for image classification.IEEE Transactions on systems, man, and cybernetics, (6):610–621,

work page

[17] [17]

Hiflow: Generating diverse hi maps and inferring cosmology while marginalizing over as- trophysics using normalizing flows.The Astrophysical Jour- nal, 937(2):83, 2022

Sultan Hassan, Francisco Villaescusa-Navarro, Benjamin Wandelt, David N Spergel, Daniel Angl ´es-Alc´azar, Shy Genel, Miles Cranmer, Greg L Bryan, Romeel Dav ´e, Rachel S Somerville, et al. Hiflow: Generating diverse hi maps and inferring cosmology while marginalizing over as- trophysics using normalizing flows.The Astrophysical Jour- nal, 937(2):83, 2022. 6

work page 2022

[18] [18]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6, 1

work page 2022

[20] [20]

Smartedit: Exploring complex instruction-based image editing with multimodal large lan- guage models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large lan- guage models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362– 8371, 2024. 2

work page 2024

[21] [21]

Hq-edit: A high-quality dataset for instruction-based image editing

Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990, 2024. 4

work page arXiv 2024

[22] [22]

Focusing.International Journal of Computer Vision, 1(3):223–237, 1988

Eric Krotkov. Focusing.International Journal of Computer Vision, 1(3):223–237, 1988. 4

work page 1988

[23] [23]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12268–12290, 2024. 2, 6

work page 2024

[24] [24]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2

work page 2024

[25] [25]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

2, 5, 6, 7, 8, 1, 4, 9, 10

work page

[27] [27]

Balancing preservation and modification: A region and semantic aware metric for instruction-based image editing.arXiv preprint arXiv:2506.13827, 2025

Zhuoying Li, Zhu Xu, Yuxin Peng, and Yang Liu. Balancing preservation and modification: A region and semantic aware metric for instruction-based image editing.arXiv preprint arXiv:2506.13827, 2025. 4

work page arXiv 2025

[28] [28]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 1, 6, 4, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

X2edit: Revisiting arbitrary- instruction image editing through self-constructed data and task-aware representation learning

Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, and Haonan Lu. X2edit: Revisiting arbitrary- instruction image editing through self-constructed data and task-aware representation learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7764– 7772, 2026. 2, 3, 4, 5, 6

work page 2026

[32] [32]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Cinescale: Free lunch in high-resolution cinematic visual generation.arXiv preprint arXiv:2508.15774, 2025

Haonan Qiu, Ning Yu, Ziqi Huang, Paul Debevec, and Zi- wei Liu. Cinescale: Free lunch in high-resolution cinematic visual generation.arXiv preprint arXiv:2508.15774, 2025. 5

work page arXiv 2025

[35] [35]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021

[36] [36]

Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024. 2, 6

work page 2024

[37] [37]

Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024. 3

work page arXiv 2024

[38] [38]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 4

work page 2022

[39] [39]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 2, 6, 4, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Emu edit: Precise image editing via recognition and gen- eration tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and gen- eration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871– 8879, 2024. 1

work page 2024

[41] [41]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page

[42] [42]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020. 3

work page 2020

[43] [43]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023. 2

work page 1921

[44] [44]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

work page 2017

[45] [45]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

work page 1905

[46] [46]

Omniedit: Building image edit- ing generalist models through specialist supervision

Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image edit- ing generalist models through specialist supervision. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 2, 4

work page 2024

[47] [47]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 3

work page 2022

[48] [48]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Ultravideo: High-quality uhd video dataset with comprehensive captions.arXiv preprint arXiv:2506.13691, 2025

Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, Yuxuan Cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, et al. Ultravideo: High-quality uhd video dataset with comprehensive captions.arXiv preprint arXiv:2506.13691, 2025. 4

work page arXiv 2025

[51] [51]

Echo-4o: Harnessing the power of gpt- 4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zheng- hao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt- 4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025. 2, 3, 4

work page arXiv 2025

[52] [52]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A uni- fied image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025. 2, 4, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26125–26135, 2025. 1, 2, 4, 5, 6

work page 2025

[54] [54]

Big bird: Transformers for longer sequences.Advances in neu- ral information processing systems, 33:17283–17297, 2020

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neu- ral information processing systems, 33:17283–17297, 2020. 5

work page 2020

[55] [55]

Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464– 23473, 2025. 2

work page 2025

[56] [56]

Designing a practical degradation model for deep blind im- age super-resolution

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind im- age super-resolution. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4791–4800,

work page

[57] [57]

Enabling instructional image editing with in-context genera- tion in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context genera- tion in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems, 2025. 1, 2, 6, 4, 9, 10

work page 2025

[58] [58]

Ultrahr-100k: En- hancing uhr image synthesis with a large-scale high-quality dataset.arXiv preprint arXiv:2510.20661, 2025

Chen Zhao, En Ci, Yunzhe Xu, Tiehan Fan, Shanyan Guan, Yanhao Ge, Jian Yang, and Ying Tai. Ultrahr-100k: En- hancing uhr image synthesis with a large-scale high-quality dataset.arXiv preprint arXiv:2510.20661, 2025. 2, 6

work page arXiv 2025

[59] [59]

Vision-to-Edit

Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024. 2, 4 VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset Supplementa...

work page 2024