GenClaw: Code-Driven Agentic Image Generation
Pith reviewed 2026-06-29 07:35 UTC · model grok-4.3
The pith
GenClaw turns AI image generation into a staged process by inserting executable code sketches between reasoning and pixel synthesis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GenClaw is a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, ThreeJS) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis.
What carries the argument
Code as a controllable intermediate canvas that bridges the agent's linguistic reasoning and downstream pixel synthesis in a three-stage workflow.
If this is right
- Agents gain direct editability over visual structure by modifying the code sketch rather than relying solely on repeated prompt adjustments.
- The generation process gains interpretability because the code layer exposes the agent's reasoning in a readable and revisable form.
- Programmatic control over layout and elements can be combined with the strengths of generative models for photorealistic output.
- The workflow reduces dependence on black-box refinement loops by providing an explicit intermediate representation.
Where Pith is reading between the lines
- If code-generation reliability increases, the same staged approach could apply to domains such as 3D scene creation or animation where intermediate representations already exist.
- The separation of structure via code from appearance via image models could reduce the need to retrain large image generators for better structural fidelity.
- Inspecting and editing the code canvas might offer a practical debugging path for generation failures that current prompt-only systems lack.
Load-bearing premise
Large language models can generate accurate, executable code that correctly captures the agent's intended visual concept and that this code combines cleanly with image models.
What would settle it
An experiment in which the code produced by the agent repeatedly fails to match the planned concept or introduces visual errors that the final image model cannot resolve.
read the original abstract
Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, ThreeJS) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GenClaw, a code-driven agentic image generation paradigm in which an LLM agent first builds conceptual knowledge via search and reasoning, then renders an executable visual sketch using code (SVG, HTML, or ThreeJS), and finally applies an image generation model to add textures, materials, and photorealism. Code is positioned as a controllable intermediate canvas that bridges linguistic reasoning and pixel synthesis, transforming black-box generation into a staged, human-like process for greater controllability and interpretability.
Significance. If the workflow can be shown to work reliably, the staged code-mediated approach could meaningfully improve controllability and interpretability in agentic image generation by allowing direct programmatic manipulation of visual structure before photorealistic refinement. This would represent a conceptual advance over purely prompt-based black-box agents.
major comments (3)
- Abstract: The central claim that GenClaw 'empowers the agent to create like a human artist' and offers 'a step toward highly controllable and interpretable visual generation systems' is unsupported because the manuscript contains no experiments, ablation studies, user evaluations, quantitative metrics, or even implementation details demonstrating that the proposed workflow achieves these benefits.
- Abstract: The proposal rests on the untested assumption that current LLMs can reliably emit correct, intent-preserving executable code (SVG/HTML/ThreeJS) that accurately captures conceptual reasoning; no failure-mode analysis, error rates, or comparison against black-box baselines is provided to substantiate this.
- Abstract: No evidence or discussion is given on whether the code-to-image handoff preserves control or introduces new artifacts, which is load-bearing for the claim that code serves as a 'seamlessly integrating' controllable intermediate representation.
minor comments (1)
- Abstract: The sentence 'offers a step toward for highly controllable' contains a grammatical error and should be corrected.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. The work presents a conceptual proposal for a code-mediated workflow rather than an empirical study. We will revise the abstract and add a dedicated limitations and future work section to ensure claims are appropriately scoped and to outline evaluation directions.
read point-by-point responses
-
Referee: Abstract: The central claim that GenClaw 'empowers the agent to create like a human artist' and offers 'a step toward highly controllable and interpretable visual generation systems' is unsupported because the manuscript contains no experiments, ablation studies, user evaluations, quantitative metrics, or even implementation details demonstrating that the proposed workflow achieves these benefits.
Authors: We agree that the current abstract language overstates demonstrated outcomes. The manuscript introduces the staged workflow as a conceptual paradigm; the quoted phrases describe intended properties of the approach. In revision we will replace these with more measured wording (e.g., 'aims to empower' and 'potentially offers a step toward') and add an explicit statement that empirical validation remains future work. revision: yes
-
Referee: Abstract: The proposal rests on the untested assumption that current LLMs can reliably emit correct, intent-preserving executable code (SVG/HTML/ThreeJS) that accurately captures conceptual reasoning; no failure-mode analysis, error rates, or comparison against black-box baselines is provided to substantiate this.
Authors: The manuscript does rely on the premise that LLMs can produce usable code sketches, drawing from observed capabilities in related literature, but provides no dedicated analysis of failure modes. We will add a new subsection under Limitations that enumerates known risks (syntax errors, semantic drift, style mismatch) and sketches how future controlled studies could quantify them against direct image-generation baselines. revision: yes
-
Referee: Abstract: No evidence or discussion is given on whether the code-to-image handoff preserves control or introduces new artifacts, which is load-bearing for the claim that code serves as a 'seamlessly integrating' controllable intermediate representation.
Authors: We concur that the handoff step is central and currently undiscussed. Revision will include a short analysis of the transition, noting that the image model receives both the rendered sketch and a textual prompt derived from the same reasoning trace, and will flag potential artifacts (e.g., loss of precise geometry, texture hallucination). We will also qualify the term 'seamlessly integrating' to 'intended to integrate'. revision: yes
Circularity Check
No circularity: conceptual paradigm proposal without derivations or self-referential reductions
full rationale
The manuscript is a forward-looking proposal for a code-driven agentic workflow (conceptualization → code sketch → image refinement) with no equations, fitted parameters, predictions, or load-bearing self-citations. No step reduces by construction to its own inputs, as there are no quantitative claims, uniqueness theorems, or ansatzes to inspect. The central claim remains a descriptive suggestion rather than a derived result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can reliably generate correct executable code (SVG, HTML, ThreeJS) that captures conceptual knowledge for visual sketches
Reference graph
Works this paper leans on
-
[1]
Claude.https://www.anthropic.com/claude, 2024
Anthropic. Claude.https://www.anthropic.com/claude, 2024. Accessed: 2026-05-07
2024
-
[2]
Flux2max: Nextgenerationimagesynthesis
BlackForestLabs. Flux2max: Nextgenerationimagesynthesis. https://bfl.ai/models/flux-2-max,
-
[3]
Accessed: 2026-01-26
2026
-
[4]
Flux 2 pro: State-of-the-art quality at maximum speed.https://bfl.ai/models/flux -2, 2026
Black Forest Labs. Flux 2 pro: State-of-the-art quality at maximum speed.https://bfl.ai/models/flux -2, 2026. Accessed: 2026-01-26
2026
-
[5]
FLUX.2 [klein]: Towards Interactive Visual Intelligence.https://bfl.ai/blog/flux 2-klein-towards-interactive-visual-intelligence, 2026
Black Forest Labs. FLUX.2 [klein]: Towards Interactive Visual Intelligence.https://bfl.ai/blog/flux 2-klein-towards-interactive-visual-intelligence, 2026. Accessed: 2026-05-07
2026
-
[6]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, et al. Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026
-
[9]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024
2024
-
[11]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. Technical Report, 2026
2026
-
[15]
Emerging Properties in Unified Multimodal Pretraining
ChaoruiDeng,DeyaoZhu,KunchangLi,ChenhuiGou,FengLi,ZeyuWang,ShuZhong,WeihaoYu,XiaonanNie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[17]
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025. 18
-
[19]
Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020
2020
-
[20]
Gemini 2.0 flash.https://developers.googleblog.com/en/experiment-with-gem ini-20-flash-native-image-generation, 2025
Google. Gemini 2.0 flash.https://developers.googleblog.com/en/experiment-with-gem ini-20-flash-native-image-generation, 2025
2025
-
[21]
Gemini 3: Introducing the latest gemini ai model from google.https://blog.google/products /gemini/gemini-3/, 2025
Google. Gemini 3: Introducing the latest gemini ai model from google.https://blog.google/products /gemini/gemini-3/, 2025. Released November 18, 2025. Accessed: 2026-05-20
2025
-
[22]
Gemini image pro: High-quality image generation.https://deepmind.google/mode ls/gemini-image/pro/, 2025
Google DeepMind. Gemini image pro: High-quality image generation.https://deepmind.google/mode ls/gemini-image/pro/, 2025. Accessed: 2026-01-26
2025
-
[23]
Gemini image: High-quality image generation.https://deepmind.google/models /gemini-image/flash/, 2025
Google DeepMind. Gemini image: High-quality image generation.https://deepmind.google/models /gemini-image/flash/, 2025. Accessed: 2026-01-26
2025
-
[24]
Controlling your image via simplified vector graphics.arXiv preprint arXiv:2602.14443, 2026
Lanqing Guo, Xi Liu, Yufei Wang, Zhihao Li, and Siyu Huang. Controlling your image via simplified vector graphics.arXiv preprint arXiv:2602.14443, 2026
-
[25]
Mind-brush: Integrating agentic cognitive search and reasoning into image generation,
Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, and Weijia Li. Mind-brush: Integrating agentic cognitive search and reasoning into image generation. 2026. URL https://arxiv.org/abs/2602.01756
-
[26]
Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025
-
[27]
Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, et al. Draco: Draft as cot for text-to-image preview and rare concept generation.arXiv preprint arXiv:2512.05112, 2025
-
[28]
Genagent: Scaling text-to-image generation via agentic multimodal reasoning, 2026
Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. Genagent: Scaling text-to-image generation via agentic multimodal reasoning, 2026. URL https://arxiv.org/abs/2601.18543
-
[29]
Segmentanything
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead,AlexanderCBerg,Wan-YenLo,etal. Segmentanything. InProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pages 4015–4026, 2023
2023
-
[30]
Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, et al. Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders.arXiv preprint arXiv:2601.10332, 2026
-
[31]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[32]
Flux.1 kontext: Flow matching for in-context image generation and editing in latent space
Black Forest Labs. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. 2025
2025
-
[33]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Coco: Code as cot for text-to-image preview and rare concept generation
Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, et al. Coco: Code as cot for text-to-image preview and rare concept generation. arXiv preprint arXiv:2603.08652, 2026
-
[35]
Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, and Conghui He. Crossviewdiff: Across-viewdiffusionmodelforsatellite-to-streetviewsynthesis.arXivpreprintarXiv:2408.14765, 2024. 19
-
[36]
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-awarefinetuningandmllmimplicitfeedback.2025.URL https://arxiv.org/abs/2510.16888
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
An llm-lvlm driven agent for iterative and fine-grained image editing
Zihan Liang, Jiahao Sun, and Haoran Ma. An llm-lvlm driven agent for iterative and fine-grained image editing. arXiv preprint arXiv:2508.17435, 2025
-
[38]
Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, and Alex Jinpeng Wang. Vcode: a multimodal coding benchmark with svg as symbolic visual representation.arXiv preprint arXiv:2511.02778, 2025
-
[39]
Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025
-
[40]
Gpt-image-1: Models and capabilities for image generation.https://platform.openai.com/ docs/models/gpt-image-1, 2024
OpenAI. Gpt-image-1: Models and capabilities for image generation.https://platform.openai.com/ docs/models/gpt-image-1, 2024. Accessed: 2026-01-29
2024
-
[41]
Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025
OpenAI. Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025
2025
-
[42]
Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai
OpenAI. Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai. com/docs/models/gpt-image-1.5, 2025. Accessed: 2026-01-29
2025
-
[43]
GPT-Image-2
OpenAI. GPT-Image-2. https://developers.openai.com/api/docs/models/gpt-image-2 , 2026
2026
-
[44]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025
-
[46]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
Tianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang, Yu Zeng, Guohui Zhang, Hang Xu, Xiaoxiao Ma, Shiting Huang, Ke Xu, et al. Scope: Structured decomposition and conditional skill orchestration for complex image generation.arXiv preprint arXiv:2605.08043, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
High-resolution image synthesiswithlatentdiffusionmodels
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesiswithlatentdiffusionmodels. InProceedingsoftheIEEE/CVFconferenceoncomputervisionandpattern recognition, pages 10684–10695, 2022
2022
-
[49]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
TeamSeedream,YunpengChen,YuGao,LixueGong,MengGuo,QiushanGuo,ZhiyaoGuo,XiaoxiaHou,Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao,LiyangLiu,WeiLiu,YanzuoLu,ZhengxiongLuo,TongtongOu,GuangShi,YichunShi,ShiqiSun,YuTian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wen...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Stable diffusion 3.5 large.https://huggingface.co/stabilityai/stable-diffusi on-3.5-large, 2024
Stability AI. Stable diffusion 3.5 large.https://huggingface.co/stabilityai/stable-diffusi on-3.5-large, 2024
2024
-
[51]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
-
[53]
Three.js: Javascript 3d library.https://threejs.org, 2024
Three.js Authors. Three.js: Javascript 3d library.https://threejs.org, 2024. Accessed: 2026-05-07. 20
2024
-
[54]
Internsvg: Towards unified svg tasks with multimodal large language models
Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, et al. Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341, 2025
-
[55]
Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Donghao Li, Tiankai Hang, Jiale Tao, Qixun Wang, Ruihuang Li, Comi Chen, et al. Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting.arXiv preprint arXiv:2509.04545, 2025
-
[56]
Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.Advances in Neural Information Processing Systems, 38:58972–59005, 2026
Siwei Wen, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Conghui He, Weijia Li, et al. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.Advances in Neural Information Processing Systems, 38:58972–59005, 2026
2026
-
[57]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning
YuhuiWu,ChenxiXie,RuibinLi,LiyiChen,QiaosiYi,andLeiZhang. Cocoedit: Content-consistentimageediting via region regularized reinforcement learning.ArXiv, abs/2602.14068, 2026. doi: 10.48550/arXiv.2602.14068. URLhttps://arxiv.org/abs/2602.14068
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.14068 2026
-
[60]
Omnisvg: A unified scalable vector graphics generation model.Advances in Neural Information Processing Systems, 38:113670–113696, 2026
Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.Advances in Neural Information Processing Systems, 38:113670–113696, 2026
2026
-
[61]
Leveraging bev paradigm for ground-to-aerial image synthesis
Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Yi Lin, Jinhua Yu, Haote Yang, and Conghui He. Leveraging bev paradigm for ground-to-aerial image synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28451–28461, 2025
2025
-
[62]
Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,
Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation. arXiv preprint arXiv:2508.09987, 2025
-
[63]
Loki: A comprehensive synthetic data detection benchmark using large multimodal models
Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models. InInternational Conference on Learning Representations, volume 2025, pages 70440–70522, 2025
2025
-
[64]
Realgen: Photorealistic text-to-image generation via detector-guided rewards,
Junyan Ye, Leiqi Zhu, Yuncheng Guo, Dongzhi Jiang, Zilong Huang, Yifan Zhang, Zhiyuan Yan, Haohuan Fu, Conghui He, and Weijia Li. Realgen: Photorealistic text-to-image generation via detector-guided rewards.arXiv preprint arXiv:2512.00473, 2025
-
[65]
ImgEdit: A Unified Image Editing Dataset and Benchmark
Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen,Heung-YeungShum,etal. Qwen-image-layered: Towardsinherenteditabilityvialayerdecomposition.arXiv preprint arXiv:2512.15603, 2025
-
[67]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 21
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.