GenClaw: Code-Driven Agentic Image Generation

Dongzhi Jiang; Jun He; Junyan Ye; Rui Chen; Weijia Li; Xuan Yang; Zilong Huang

arxiv: 2605.30248 · v2 · pith:MWJEYFE6new · submitted 2026-05-28 · 💻 cs.CV

GenClaw: Code-Driven Agentic Image Generation

Junyan Ye , Jun He , Zilong Huang , Dongzhi Jiang , Xuan Yang , Rui Chen , Weijia Li This is my paper

Pith reviewed 2026-06-29 07:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords agentic image generationcode-driven generationcontrollable visual synthesisSVG code renderingmultimodal agentsstaged image creationinterpretable generation

0 comments

The pith

GenClaw turns AI image generation into a staged process by inserting executable code sketches between reasoning and pixel synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GenClaw as a workflow in which an agent first builds conceptual context through search and reasoning, then writes code such as SVG, HTML, or ThreeJS to produce an executable sketch, and finally applies an image model to add textures and realism. This inserts code as a controllable intermediate layer that lets the agent directly manipulate structure instead of cycling through prompt revisions alone. The approach aims to make generation more like human creation, where planning, sketching, and coloring occur in sequence. A sympathetic reader would care because it addresses the lack of direct canvas control in current multimodal agents that depend entirely on black-box image models.

Core claim

GenClaw is a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, ThreeJS) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis.

What carries the argument

Code as a controllable intermediate canvas that bridges the agent's linguistic reasoning and downstream pixel synthesis in a three-stage workflow.

If this is right

Agents gain direct editability over visual structure by modifying the code sketch rather than relying solely on repeated prompt adjustments.
The generation process gains interpretability because the code layer exposes the agent's reasoning in a readable and revisable form.
Programmatic control over layout and elements can be combined with the strengths of generative models for photorealistic output.
The workflow reduces dependence on black-box refinement loops by providing an explicit intermediate representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If code-generation reliability increases, the same staged approach could apply to domains such as 3D scene creation or animation where intermediate representations already exist.
The separation of structure via code from appearance via image models could reduce the need to retrain large image generators for better structural fidelity.
Inspecting and editing the code canvas might offer a practical debugging path for generation failures that current prompt-only systems lack.

Load-bearing premise

Large language models can generate accurate, executable code that correctly captures the agent's intended visual concept and that this code combines cleanly with image models.

What would settle it

An experiment in which the code produced by the agent repeatedly fails to match the planned concept or introduces visual errors that the final image model cannot resolve.

read the original abstract

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, ThreeJS) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GenClaw is a clean conceptual proposal for inserting code sketches as a controllable step in agentic image generation, but it contains no experiments or validation.

read the letter

GenClaw proposes a three-stage workflow for agentic image generation: the agent first searches and reasons to build context, then emits code (SVG, HTML, ThreeJS) to produce an executable sketch, and finally passes that to an image model for textures and realism. Code acts as the editable intermediate canvas.

The paper does a solid job naming the core limitation of current systems—the endless cycle of prompt rewriting against black-box models—and sketching a more structured alternative that mimics human conceptualization then refinement. The framing of code as a bridge between linguistic reasoning and pixel output is a reasonable architectural suggestion.

The obvious gap is the total lack of any implementation, results, or analysis. No prototype is described, no tests check whether LLMs actually emit correct and faithful code, and there is no comparison showing improved control or fewer artifacts versus direct prompting. The central assumption that the code step will deliver better interpretability without new failure modes is left unexamined.

This work is for people designing multimodal agents who want architectural ideas rather than plug-and-play methods. A reader hunting for quantitative gains or reproducible pipelines will come up empty.

It should go to peer review. The problem statement is clear and the proposed direction is distinct enough from pure text-agent baselines that referees can usefully comment on whether the idea merits follow-up experiments.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes GenClaw, a code-driven agentic image generation paradigm in which an LLM agent first builds conceptual knowledge via search and reasoning, then renders an executable visual sketch using code (SVG, HTML, or ThreeJS), and finally applies an image generation model to add textures, materials, and photorealism. Code is positioned as a controllable intermediate canvas that bridges linguistic reasoning and pixel synthesis, transforming black-box generation into a staged, human-like process for greater controllability and interpretability.

Significance. If the workflow can be shown to work reliably, the staged code-mediated approach could meaningfully improve controllability and interpretability in agentic image generation by allowing direct programmatic manipulation of visual structure before photorealistic refinement. This would represent a conceptual advance over purely prompt-based black-box agents.

major comments (3)

Abstract: The central claim that GenClaw 'empowers the agent to create like a human artist' and offers 'a step toward highly controllable and interpretable visual generation systems' is unsupported because the manuscript contains no experiments, ablation studies, user evaluations, quantitative metrics, or even implementation details demonstrating that the proposed workflow achieves these benefits.
Abstract: The proposal rests on the untested assumption that current LLMs can reliably emit correct, intent-preserving executable code (SVG/HTML/ThreeJS) that accurately captures conceptual reasoning; no failure-mode analysis, error rates, or comparison against black-box baselines is provided to substantiate this.
Abstract: No evidence or discussion is given on whether the code-to-image handoff preserves control or introduces new artifacts, which is load-bearing for the claim that code serves as a 'seamlessly integrating' controllable intermediate representation.

minor comments (1)

Abstract: The sentence 'offers a step toward for highly controllable' contains a grammatical error and should be corrected.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The work presents a conceptual proposal for a code-mediated workflow rather than an empirical study. We will revise the abstract and add a dedicated limitations and future work section to ensure claims are appropriately scoped and to outline evaluation directions.

read point-by-point responses

Referee: Abstract: The central claim that GenClaw 'empowers the agent to create like a human artist' and offers 'a step toward highly controllable and interpretable visual generation systems' is unsupported because the manuscript contains no experiments, ablation studies, user evaluations, quantitative metrics, or even implementation details demonstrating that the proposed workflow achieves these benefits.

Authors: We agree that the current abstract language overstates demonstrated outcomes. The manuscript introduces the staged workflow as a conceptual paradigm; the quoted phrases describe intended properties of the approach. In revision we will replace these with more measured wording (e.g., 'aims to empower' and 'potentially offers a step toward') and add an explicit statement that empirical validation remains future work. revision: yes
Referee: Abstract: The proposal rests on the untested assumption that current LLMs can reliably emit correct, intent-preserving executable code (SVG/HTML/ThreeJS) that accurately captures conceptual reasoning; no failure-mode analysis, error rates, or comparison against black-box baselines is provided to substantiate this.

Authors: The manuscript does rely on the premise that LLMs can produce usable code sketches, drawing from observed capabilities in related literature, but provides no dedicated analysis of failure modes. We will add a new subsection under Limitations that enumerates known risks (syntax errors, semantic drift, style mismatch) and sketches how future controlled studies could quantify them against direct image-generation baselines. revision: yes
Referee: Abstract: No evidence or discussion is given on whether the code-to-image handoff preserves control or introduces new artifacts, which is load-bearing for the claim that code serves as a 'seamlessly integrating' controllable intermediate representation.

Authors: We concur that the handoff step is central and currently undiscussed. Revision will include a short analysis of the transition, noting that the image model receives both the rendered sketch and a textual prompt derived from the same reasoning trace, and will flag potential artifacts (e.g., loss of precise geometry, texture hallucination). We will also qualify the term 'seamlessly integrating' to 'intended to integrate'. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual paradigm proposal without derivations or self-referential reductions

full rationale

The manuscript is a forward-looking proposal for a code-driven agentic workflow (conceptualization → code sketch → image refinement) with no equations, fitted parameters, predictions, or load-bearing self-citations. No step reduces by construction to its own inputs, as there are no quantitative claims, uniqueness theorems, or ansatzes to inspect. The central claim remains a descriptive suggestion rather than a derived result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested domain assumption that LLMs can produce reliable executable visual code and that this code meaningfully improves controllability over direct pixel synthesis.

axioms (1)

domain assumption LLMs can reliably generate correct executable code (SVG, HTML, ThreeJS) that captures conceptual knowledge for visual sketches
The workflow in the abstract depends on this capability for the sketching stage to function as a controllable intermediate.

pith-pipeline@v0.9.1-grok · 5767 in / 1325 out tokens · 44744 ms · 2026-06-29T07:35:01.736989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 38 canonical work pages · 18 internal anchors

[1]

Claude.https://www.anthropic.com/claude, 2024

Anthropic. Claude.https://www.anthropic.com/claude, 2024. Accessed: 2026-05-07

2024
[2]

Flux2max: Nextgenerationimagesynthesis

BlackForestLabs. Flux2max: Nextgenerationimagesynthesis. https://bfl.ai/models/flux-2-max,
[3]

Accessed: 2026-01-26

2026
[4]

Flux 2 pro: State-of-the-art quality at maximum speed.https://bfl.ai/models/flux -2, 2026

Black Forest Labs. Flux 2 pro: State-of-the-art quality at maximum speed.https://bfl.ai/models/flux -2, 2026. Accessed: 2026-01-26

2026
[5]

FLUX.2 [klein]: Towards Interactive Visual Intelligence.https://bfl.ai/blog/flux 2-klein-towards-interactive-visual-intelligence, 2026

Black Forest Labs. FLUX.2 [klein]: Towards Interactive Visual Intelligence.https://bfl.ai/blog/flux 2-klein-towards-interactive-visual-intelligence, 2026. Accessed: 2026-05-07

2026
[6]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, et al. Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

work page arXiv 2026
[9]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

2024
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. Technical Report, 2026

2026
[15]

Emerging Properties in Unified Multimodal Pretraining

ChaoruiDeng,DeyaoZhu,KunchangLi,ChenhuiGou,FengLi,ZeyuWang,ShuZhong,WeihaoYu,XiaonanNie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[17]

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025. 18

work page arXiv 2025
[19]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

2020
[20]

Gemini 2.0 flash.https://developers.googleblog.com/en/experiment-with-gem ini-20-flash-native-image-generation, 2025

Google. Gemini 2.0 flash.https://developers.googleblog.com/en/experiment-with-gem ini-20-flash-native-image-generation, 2025

2025
[21]

Gemini 3: Introducing the latest gemini ai model from google.https://blog.google/products /gemini/gemini-3/, 2025

Google. Gemini 3: Introducing the latest gemini ai model from google.https://blog.google/products /gemini/gemini-3/, 2025. Released November 18, 2025. Accessed: 2026-05-20

2025
[22]

Gemini image pro: High-quality image generation.https://deepmind.google/mode ls/gemini-image/pro/, 2025

Google DeepMind. Gemini image pro: High-quality image generation.https://deepmind.google/mode ls/gemini-image/pro/, 2025. Accessed: 2026-01-26

2025
[23]

Gemini image: High-quality image generation.https://deepmind.google/models /gemini-image/flash/, 2025

Google DeepMind. Gemini image: High-quality image generation.https://deepmind.google/models /gemini-image/flash/, 2025. Accessed: 2026-01-26

2025
[24]

Controlling your image via simplified vector graphics.arXiv preprint arXiv:2602.14443, 2026

Lanqing Guo, Xi Liu, Yufei Wang, Zhihao Li, and Siyu Huang. Controlling your image via simplified vector graphics.arXiv preprint arXiv:2602.14443, 2026

work page arXiv 2026
[25]

Mind-brush: Integrating agentic cognitive search and reasoning into image generation,

Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, and Weijia Li. Mind-brush: Integrating agentic cognitive search and reasoning into image generation. 2026. URL https://arxiv.org/abs/2602.01756

work page arXiv 2026
[26]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

work page arXiv 2025
[27]

Draco: Draft as cot for text-to-image preview and rare concept generation.arXiv preprint arXiv:2512.05112, 2025

Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, et al. Draco: Draft as cot for text-to-image preview and rare concept generation.arXiv preprint arXiv:2512.05112, 2025

work page arXiv 2025
[28]

Genagent: Scaling text-to-image generation via agentic multimodal reasoning, 2026

Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. Genagent: Scaling text-to-image generation via agentic multimodal reasoning, 2026. URL https://arxiv.org/abs/2601.18543

work page arXiv 2026
[29]

Segmentanything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead,AlexanderCBerg,Wan-YenLo,etal. Segmentanything. InProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pages 4015–4026, 2023

2023
[30]

Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders.arXiv preprint arXiv:2601.10332, 2026

Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, et al. Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders.arXiv preprint arXiv:2601.10332, 2026

work page arXiv 2026
[31]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[32]

Flux.1 kontext: Flow matching for in-context image generation and editing in latent space

Black Forest Labs. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. 2025

2025
[33]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Coco: Code as cot for text-to-image preview and rare concept generation

Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, et al. Coco: Code as cot for text-to-image preview and rare concept generation. arXiv preprint arXiv:2603.08652, 2026

work page arXiv 2026
[35]

Crossviewdiff: Across-viewdiffusionmodelforsatellite-to-streetviewsynthesis.arXivpreprintarXiv:2408.14765, 2024

Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, and Conghui He. Crossviewdiff: Across-viewdiffusionmodelforsatellite-to-streetviewsynthesis.arXivpreprintarXiv:2408.14765, 2024. 19

work page arXiv 2024
[36]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-awarefinetuningandmllmimplicitfeedback.2025.URL https://arxiv.org/abs/2510.16888

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

An llm-lvlm driven agent for iterative and fine-grained image editing

Zihan Liang, Jiahao Sun, and Haoran Ma. An llm-lvlm driven agent for iterative and fine-grained image editing. arXiv preprint arXiv:2508.17435, 2025

work page arXiv 2025
[38]

Vcode: a multimodal coding benchmark with svg as symbolic visual representation.arXiv preprint arXiv:2511.02778, 2025

Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, and Alex Jinpeng Wang. Vcode: a multimodal coding benchmark with svg as symbolic visual representation.arXiv preprint arXiv:2511.02778, 2025

work page arXiv 2025
[39]

Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

work page arXiv 2025
[40]

Gpt-image-1: Models and capabilities for image generation.https://platform.openai.com/ docs/models/gpt-image-1, 2024

OpenAI. Gpt-image-1: Models and capabilities for image generation.https://platform.openai.com/ docs/models/gpt-image-1, 2024. Accessed: 2026-01-29

2024
[41]

Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025

OpenAI. Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025

2025
[42]

Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai

OpenAI. Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai. com/docs/models/gpt-image-1.5, 2025. Accessed: 2026-01-29

2025
[43]

GPT-Image-2

OpenAI. GPT-Image-2. https://developers.openai.com/api/docs/models/gpt-image-2 , 2026

2026
[44]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025

work page arXiv 2025
[46]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Tianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang, Yu Zeng, Guohui Zhang, Hang Xu, Xiaoxiao Ma, Shiting Huang, Ke Xu, et al. Scope: Structured decomposition and conditional skill orchestration for complex image generation.arXiv preprint arXiv:2605.08043, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

High-resolution image synthesiswithlatentdiffusionmodels

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesiswithlatentdiffusionmodels. InProceedingsoftheIEEE/CVFconferenceoncomputervisionandpattern recognition, pages 10684–10695, 2022

2022
[49]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

TeamSeedream,YunpengChen,YuGao,LixueGong,MengGuo,QiushanGuo,ZhiyaoGuo,XiaoxiaHou,Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao,LiyangLiu,WeiLiu,YanzuoLu,ZhengxiongLuo,TongtongOu,GuangShi,YichunShi,ShiqiSun,YuTian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wen...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Stable diffusion 3.5 large.https://huggingface.co/stabilityai/stable-diffusi on-3.5-large, 2024

Stability AI. Stable diffusion 3.5 large.https://huggingface.co/stabilityai/stable-diffusi on-3.5-large, 2024

2024
[51]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

work page arXiv 2026
[53]

Three.js: Javascript 3d library.https://threejs.org, 2024

Three.js Authors. Three.js: Javascript 3d library.https://threejs.org, 2024. Accessed: 2026-05-07. 20

2024
[54]

Internsvg: Towards unified svg tasks with multimodal large language models

Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, et al. Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341, 2025

work page arXiv 2025
[55]

Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting.arXiv preprint arXiv:2509.04545, 2025

Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Donghao Li, Tiankai Hang, Jiale Tao, Qixun Wang, Ruihuang Li, Comi Chen, et al. Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting.arXiv preprint arXiv:2509.04545, 2025

work page arXiv 2025
[56]

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.Advances in Neural Information Processing Systems, 38:58972–59005, 2026

Siwei Wen, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Conghui He, Weijia Li, et al. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.Advances in Neural Information Processing Systems, 38:58972–59005, 2026

2026
[57]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

YuhuiWu,ChenxiXie,RuibinLi,LiyiChen,QiaosiYi,andLeiZhang. Cocoedit: Content-consistentimageediting via region regularized reinforcement learning.ArXiv, abs/2602.14068, 2026. doi: 10.48550/arXiv.2602.14068. URLhttps://arxiv.org/abs/2602.14068

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.14068 2026
[60]

Omnisvg: A unified scalable vector graphics generation model.Advances in Neural Information Processing Systems, 38:113670–113696, 2026

Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.Advances in Neural Information Processing Systems, 38:113670–113696, 2026

2026
[61]

Leveraging bev paradigm for ground-to-aerial image synthesis

Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Yi Lin, Jinhua Yu, Haote Yang, and Conghui He. Leveraging bev paradigm for ground-to-aerial image synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28451–28461, 2025

2025
[62]

Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation. arXiv preprint arXiv:2508.09987, 2025

work page arXiv 2025
[63]

Loki: A comprehensive synthetic data detection benchmark using large multimodal models

Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models. InInternational Conference on Learning Representations, volume 2025, pages 70440–70522, 2025

2025
[64]

Realgen: Photorealistic text-to-image generation via detector-guided rewards,

Junyan Ye, Leiqi Zhu, Yuncheng Guo, Dongzhi Jiang, Zilong Huang, Yifan Zhang, Zhiyuan Yan, Haohuan Fu, Conghui He, and Weijia Li. Realgen: Photorealistic text-to-image generation via detector-guided rewards.arXiv preprint arXiv:2512.00473, 2025

work page arXiv 2025
[65]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Qwen-image-layered: Towards inherent editability via layer decomposition.arXiv preprint arXiv:2512.15603, 2025

Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen,Heung-YeungShum,etal. Qwen-image-layered: Towardsinherenteditabilityvialayerdecomposition.arXiv preprint arXiv:2512.15603, 2025

work page arXiv 2025
[67]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 21

2023

[1] [1]

Claude.https://www.anthropic.com/claude, 2024

Anthropic. Claude.https://www.anthropic.com/claude, 2024. Accessed: 2026-05-07

2024

[2] [2]

Flux2max: Nextgenerationimagesynthesis

BlackForestLabs. Flux2max: Nextgenerationimagesynthesis. https://bfl.ai/models/flux-2-max,

[3] [3]

Accessed: 2026-01-26

2026

[4] [4]

Flux 2 pro: State-of-the-art quality at maximum speed.https://bfl.ai/models/flux -2, 2026

Black Forest Labs. Flux 2 pro: State-of-the-art quality at maximum speed.https://bfl.ai/models/flux -2, 2026. Accessed: 2026-01-26

2026

[5] [5]

FLUX.2 [klein]: Towards Interactive Visual Intelligence.https://bfl.ai/blog/flux 2-klein-towards-interactive-visual-intelligence, 2026

Black Forest Labs. FLUX.2 [klein]: Towards Interactive Visual Intelligence.https://bfl.ai/blog/flux 2-klein-towards-interactive-visual-intelligence, 2026. Accessed: 2026-05-07

2026

[6] [6]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, et al. Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

work page arXiv 2026

[9] [9]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

2024

[11] [11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. Technical Report, 2026

2026

[14] [15]

Emerging Properties in Unified Multimodal Pretraining

ChaoruiDeng,DeyaoZhu,KunchangLi,ChenhuiGou,FengLi,ZeyuWang,ShuZhong,WeihaoYu,XiaonanNie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [16]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[16] [17]

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [18]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025. 18

work page arXiv 2025

[18] [19]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

2020

[19] [20]

Gemini 2.0 flash.https://developers.googleblog.com/en/experiment-with-gem ini-20-flash-native-image-generation, 2025

Google. Gemini 2.0 flash.https://developers.googleblog.com/en/experiment-with-gem ini-20-flash-native-image-generation, 2025

2025

[20] [21]

Gemini 3: Introducing the latest gemini ai model from google.https://blog.google/products /gemini/gemini-3/, 2025

Google. Gemini 3: Introducing the latest gemini ai model from google.https://blog.google/products /gemini/gemini-3/, 2025. Released November 18, 2025. Accessed: 2026-05-20

2025

[21] [22]

Gemini image pro: High-quality image generation.https://deepmind.google/mode ls/gemini-image/pro/, 2025

Google DeepMind. Gemini image pro: High-quality image generation.https://deepmind.google/mode ls/gemini-image/pro/, 2025. Accessed: 2026-01-26

2025

[22] [23]

Gemini image: High-quality image generation.https://deepmind.google/models /gemini-image/flash/, 2025

Google DeepMind. Gemini image: High-quality image generation.https://deepmind.google/models /gemini-image/flash/, 2025. Accessed: 2026-01-26

2025

[23] [24]

Controlling your image via simplified vector graphics.arXiv preprint arXiv:2602.14443, 2026

Lanqing Guo, Xi Liu, Yufei Wang, Zhihao Li, and Siyu Huang. Controlling your image via simplified vector graphics.arXiv preprint arXiv:2602.14443, 2026

work page arXiv 2026

[24] [25]

Mind-brush: Integrating agentic cognitive search and reasoning into image generation,

Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, and Weijia Li. Mind-brush: Integrating agentic cognitive search and reasoning into image generation. 2026. URL https://arxiv.org/abs/2602.01756

work page arXiv 2026

[25] [26]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

work page arXiv 2025

[26] [27]

Draco: Draft as cot for text-to-image preview and rare concept generation.arXiv preprint arXiv:2512.05112, 2025

Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, et al. Draco: Draft as cot for text-to-image preview and rare concept generation.arXiv preprint arXiv:2512.05112, 2025

work page arXiv 2025

[27] [28]

Genagent: Scaling text-to-image generation via agentic multimodal reasoning, 2026

Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. Genagent: Scaling text-to-image generation via agentic multimodal reasoning, 2026. URL https://arxiv.org/abs/2601.18543

work page arXiv 2026

[28] [29]

Segmentanything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead,AlexanderCBerg,Wan-YenLo,etal. Segmentanything. InProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pages 4015–4026, 2023

2023

[29] [30]

Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders.arXiv preprint arXiv:2601.10332, 2026

Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, et al. Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders.arXiv preprint arXiv:2601.10332, 2026

work page arXiv 2026

[30] [31]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024

[31] [32]

Flux.1 kontext: Flow matching for in-context image generation and editing in latent space

Black Forest Labs. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. 2025

2025

[32] [33]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [34]

Coco: Code as cot for text-to-image preview and rare concept generation

Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, et al. Coco: Code as cot for text-to-image preview and rare concept generation. arXiv preprint arXiv:2603.08652, 2026

work page arXiv 2026

[34] [35]

Crossviewdiff: Across-viewdiffusionmodelforsatellite-to-streetviewsynthesis.arXivpreprintarXiv:2408.14765, 2024

Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, and Conghui He. Crossviewdiff: Across-viewdiffusionmodelforsatellite-to-streetviewsynthesis.arXivpreprintarXiv:2408.14765, 2024. 19

work page arXiv 2024

[35] [36]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-awarefinetuningandmllmimplicitfeedback.2025.URL https://arxiv.org/abs/2510.16888

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [37]

An llm-lvlm driven agent for iterative and fine-grained image editing

Zihan Liang, Jiahao Sun, and Haoran Ma. An llm-lvlm driven agent for iterative and fine-grained image editing. arXiv preprint arXiv:2508.17435, 2025

work page arXiv 2025

[37] [38]

Vcode: a multimodal coding benchmark with svg as symbolic visual representation.arXiv preprint arXiv:2511.02778, 2025

Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, and Alex Jinpeng Wang. Vcode: a multimodal coding benchmark with svg as symbolic visual representation.arXiv preprint arXiv:2511.02778, 2025

work page arXiv 2025

[38] [39]

Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

work page arXiv 2025

[39] [40]

Gpt-image-1: Models and capabilities for image generation.https://platform.openai.com/ docs/models/gpt-image-1, 2024

OpenAI. Gpt-image-1: Models and capabilities for image generation.https://platform.openai.com/ docs/models/gpt-image-1, 2024. Accessed: 2026-01-29

2024

[40] [41]

Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025

OpenAI. Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025

2025

[41] [42]

Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai

OpenAI. Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai. com/docs/models/gpt-image-1.5, 2025. Accessed: 2026-01-29

2025

[42] [43]

GPT-Image-2

OpenAI. GPT-Image-2. https://developers.openai.com/api/docs/models/gpt-image-2 , 2026

2026

[43] [44]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [45]

Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025

work page arXiv 2025

[45] [46]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [47]

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Tianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang, Yu Zeng, Guohui Zhang, Hang Xu, Xiaoxiao Ma, Shiting Huang, Ke Xu, et al. Scope: Structured decomposition and conditional skill orchestration for complex image generation.arXiv preprint arXiv:2605.08043, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [48]

High-resolution image synthesiswithlatentdiffusionmodels

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesiswithlatentdiffusionmodels. InProceedingsoftheIEEE/CVFconferenceoncomputervisionandpattern recognition, pages 10684–10695, 2022

2022

[48] [49]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

TeamSeedream,YunpengChen,YuGao,LixueGong,MengGuo,QiushanGuo,ZhiyaoGuo,XiaoxiaHou,Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao,LiyangLiu,WeiLiu,YanzuoLu,ZhengxiongLuo,TongtongOu,GuangShi,YichunShi,ShiqiSun,YuTian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wen...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [50]

Stable diffusion 3.5 large.https://huggingface.co/stabilityai/stable-diffusi on-3.5-large, 2024

Stability AI. Stable diffusion 3.5 large.https://huggingface.co/stabilityai/stable-diffusi on-3.5-large, 2024

2024

[50] [51]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [52]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

work page arXiv 2026

[52] [53]

Three.js: Javascript 3d library.https://threejs.org, 2024

Three.js Authors. Three.js: Javascript 3d library.https://threejs.org, 2024. Accessed: 2026-05-07. 20

2024

[53] [54]

Internsvg: Towards unified svg tasks with multimodal large language models

Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, et al. Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341, 2025

work page arXiv 2025

[54] [55]

Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting.arXiv preprint arXiv:2509.04545, 2025

Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Donghao Li, Tiankai Hang, Jiale Tao, Qixun Wang, Ruihuang Li, Comi Chen, et al. Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting.arXiv preprint arXiv:2509.04545, 2025

work page arXiv 2025

[55] [56]

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.Advances in Neural Information Processing Systems, 38:58972–59005, 2026

Siwei Wen, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Conghui He, Weijia Li, et al. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.Advances in Neural Information Processing Systems, 38:58972–59005, 2026

2026

[56] [57]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [58]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [59]

CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

YuhuiWu,ChenxiXie,RuibinLi,LiyiChen,QiaosiYi,andLeiZhang. Cocoedit: Content-consistentimageediting via region regularized reinforcement learning.ArXiv, abs/2602.14068, 2026. doi: 10.48550/arXiv.2602.14068. URLhttps://arxiv.org/abs/2602.14068

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.14068 2026

[59] [60]

Omnisvg: A unified scalable vector graphics generation model.Advances in Neural Information Processing Systems, 38:113670–113696, 2026

Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.Advances in Neural Information Processing Systems, 38:113670–113696, 2026

2026

[60] [61]

Leveraging bev paradigm for ground-to-aerial image synthesis

Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Yi Lin, Jinhua Yu, Haote Yang, and Conghui He. Leveraging bev paradigm for ground-to-aerial image synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28451–28461, 2025

2025

[61] [62]

Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation. arXiv preprint arXiv:2508.09987, 2025

work page arXiv 2025

[62] [63]

Loki: A comprehensive synthetic data detection benchmark using large multimodal models

Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models. InInternational Conference on Learning Representations, volume 2025, pages 70440–70522, 2025

2025

[63] [64]

Realgen: Photorealistic text-to-image generation via detector-guided rewards,

Junyan Ye, Leiqi Zhu, Yuncheng Guo, Dongzhi Jiang, Zilong Huang, Yifan Zhang, Zhiyuan Yan, Haohuan Fu, Conghui He, and Weijia Li. Realgen: Photorealistic text-to-image generation via detector-guided rewards.arXiv preprint arXiv:2512.00473, 2025

work page arXiv 2025

[64] [65]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [66]

Qwen-image-layered: Towards inherent editability via layer decomposition.arXiv preprint arXiv:2512.15603, 2025

Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen,Heung-YeungShum,etal. Qwen-image-layered: Towardsinherenteditabilityvialayerdecomposition.arXiv preprint arXiv:2512.15603, 2025

work page arXiv 2025

[66] [67]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 21

2023