Gen-Searcher: Reinforcing Agentic Search for Image Generation
Pith reviewed 2026-05-25 06:28 UTC · model grok-4.3
The pith
An image generation agent trained to perform multi-hop search for knowledge and reference images achieves substantial performance gains on knowledge-intensive tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gen-Searcher is presented as the first search-augmented image generation agent that performs multi-hop reasoning and search to collect textual knowledge and reference images needed for grounded generation. Two datasets are curated, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, and the model is trained with SFT followed by agentic reinforcement learning using dual reward feedback in GRPO training. This yields substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE.
What carries the argument
The dual-reward GRPO training combining text-based and image-based rewards to provide stable and informative learning signals for the agent performing multi-hop search.
If this is right
- Image generators can handle knowledge-intensive or time-sensitive prompts by dynamically searching for information.
- The approach provides a foundation for open development of search agents in visual generation tasks.
- Dual reward feedback enables more effective reinforcement learning for agentic behaviors in generation.
- New benchmarks like KnowGen allow systematic evaluation of search-grounded image generation capabilities.
Where Pith is reading between the lines
- Integrating live web search could allow the agent to generate images based on the most current events.
- The method may generalize to other modalities such as video generation requiring external references.
- Overfitting risks could be mitigated by expanding the diversity of training prompts beyond the curated sets.
Load-bearing premise
The specific datasets and dual-reward GRPO training lead to genuine multi-hop search behavior and grounded generation rather than overfitting to the training prompts and images.
What would settle it
Evaluating the trained agent on a held-out set of prompts that require information absent from the training datasets and checking if it performs accurate searches and generates correct images.
Figures
read the original abstract
Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Gen-Searcher as the first search-augmented image generation agent that performs multi-hop reasoning and external search to collect textual knowledge and reference images for grounded generation. The authors construct a tailored data pipeline to curate Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k datasets containing search-intensive prompts and ground-truth synthesis images, introduce the KnowGen benchmark for evaluating search-grounded image generation across multiple dimensions, and train the model with SFT followed by agentic reinforcement learning using dual-reward GRPO that combines text-based and image-based signals. Experiments report substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE.
Significance. If the reported gains reflect genuine acquisition of multi-hop search behavior that generalizes beyond the training distribution, the work would be significant as the first open foundation for agentic search in image generation. The open-sourcing of the datasets, models, and code is a concrete strength that enables reproducibility and follow-on research.
major comments (2)
- [Abstract / Data Construction] Abstract and data pipeline description: both the training sets (Gen-Searcher-SFT-10k, RL-6k) and the KnowGen benchmark are generated by the same 'tailored data pipeline' that produces 'search-intensive prompts and corresponding ground-truth synthesis images,' yet no quantitative checks (embedding similarity, n-gram overlap, or deduplication) are reported. This directly undermines the central claim of 16-point gains on KnowGen, because the dual-reward GRPO objective could improve scores by fitting the specific synthesis style and prompt patterns rather than learning robust external search.
- [Experiments] Experiments section: the abstract states clear numerical gains but supplies no information on baseline controls, statistical significance, data leakage verification, or ablation of the dual-reward component and the GRPO stage. Without these, the attribution of the 16- and 15-point improvements specifically to the agentic search pipeline remains only partially supported.
minor comments (1)
- [Abstract] The abstract refers to 'around 16 points' and '15 points' on KnowGen and WISE but does not name the underlying evaluation metrics or scoring protocol.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The concerns about potential data overlap and the need for stronger experimental controls are valid and help improve the clarity of our claims. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Abstract / Data Construction] Abstract and data pipeline description: both the training sets (Gen-Searcher-SFT-10k, RL-6k) and the KnowGen benchmark are generated by the same 'tailored data pipeline' that produces 'search-intensive prompts and corresponding ground-truth synthesis images,' yet no quantitative checks (embedding similarity, n-gram overlap, or deduplication) are reported. This directly undermines the central claim of 16-point gains on KnowGen, because the dual-reward GRPO objective could improve scores by fitting the specific synthesis style and prompt patterns rather than learning robust external search.
Authors: We agree that the absence of quantitative overlap checks is a limitation in the current manuscript. The tailored pipeline was designed to produce diverse, search-intensive prompts drawn from varied knowledge domains, with manual curation steps intended to promote variety. Nevertheless, without reported metrics it is difficult to fully rule out style or pattern fitting. In the revision we will add embedding similarity (e.g., cosine similarity on sentence embeddings), n-gram overlap statistics, and explicit deduplication results between the SFT/RL training sets and KnowGen. These analyses will be placed in a new subsection of the data construction section. revision: yes
-
Referee: [Experiments] Experiments section: the abstract states clear numerical gains but supplies no information on baseline controls, statistical significance, data leakage verification, or ablation of the dual-reward component and the GRPO stage. Without these, the attribution of the 16- and 15-point improvements specifically to the agentic search pipeline remains only partially supported.
Authors: The manuscript reports gains relative to Qwen-Image on KnowGen and WISE, but we acknowledge that details on statistical significance, explicit leakage verification, and component ablations are missing. In the revised version we will (1) add statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals) for the reported improvements, (2) incorporate the overlap and leakage checks described above, and (3) include ablation studies that isolate the contribution of the dual-reward signals versus the GRPO stage alone. These additions will be presented in an expanded Experiments section with a dedicated ablation table. revision: yes
Circularity Check
No circularity; empirical pipeline with external benchmarks.
full rationale
The paper describes constructing a data pipeline to curate Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, training via SFT then dual-reward GRPO, and reporting empirical gains on the separately introduced KnowGen benchmark plus WISE. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All load-bearing claims are performance deltas on held-out benchmarks, which remain falsifiable outside the training distribution. Potential distributional overlap between pipeline-generated sets is a data-validity issue, not a reduction of any derivation to its own inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 9 Pith papers
-
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...
-
Aurora: Unified Video Editing with a Tool-Using Agent
Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to Flow Matching models through specialized teachers, cold-start initialization, task routing, and manifold regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 o...
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
-
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
-
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.
Reference graph
Works this paper leans on
-
[1]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Gemini image pro: High-quality image generation
Google DeepMind. Gemini image pro: High-quality image generation. https://deepmind.google/models/ gemini-image/pro/, 2025
work page 2025
-
[3]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Re-imagen: Retrieval-augmented text-to-image generator
Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022
-
[5]
Retrieval-augmented diffusion models
Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022
work page 2022
-
[6]
Zhiyou Xiao, Qinhan Yu, Binghui Li, Geng Chen, Chong Chen, and Wentao Zhang. M2io-r1: An efficient rl-enhanced reasoning framework for multimodal retrieval augmented multimodal generation.arXiv preprint arXiv:2508.06328, 2025
-
[7]
Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025
Chuanhao Li, Jianwen Sun, Yukang Feng, Mingliang Zhai, Yifan Chang, and Kaipeng Zhang. Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025
-
[8]
Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, and Weijia Li. Mind-brush: Integrating agentic cognitive search and reasoning into image generation.arXiv preprint arXiv:2602.01756, 2026
-
[9]
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Exploring Reasoning Reward Model for Agents
Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, and Xiangyu Yue. Exploring reasoning reward model for agents.arXiv preprint arXiv:2601.22154, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025
Google DeepMind. Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025
work page 2025
-
[12]
Seed1.8 model card: Towards generalized real-world agency
Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency. https://seed.bytedance.com/en/ seed1_8, 2025. 13 Gen-Searcher: Reinforcing Agentic Search for Image Generation
work page 2025
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025
Bytedance Seed. Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025
work page 2025
-
[16]
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025
-
[18]
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Hang Chen, Qian Xiang, Jiaxin Hu, Meilin Ye, Chao Yu, Hao Cheng, and Lei Zhang. Comprehensive exploration of diffusion models in image generation: a survey.Artificial Intelligence Review, 58(4):99, 2025
work page 2025
-
[20]
Stability AI. Stable diffusion 3.5 large. https://huggingface.co/stabilityai/stable-diffusion-3. 5-large, 2024
work page 2024
-
[21]
Imagen.https://deepmind.google/models/imagen/, 2025
Google DeepMind. Imagen.https://deepmind.google/models/imagen/, 2025
work page 2025
-
[22]
Flux 1.https://github.com/black-forest-labs/flux, 2024
black-forest labs. Flux 1.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[23]
LongCat-Image Technical Report
Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Agentic Reinforced Policy Optimization
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
AdaTooler-V: Adaptive Tool-Use for Images and Videos
Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, et al. Adatooler-v: Adaptive tool-use for images and videos.arXiv preprint arXiv:2512.16918, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025
-
[28]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025
-
[30]
Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025
Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025
-
[31]
Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025
-
[32]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026
-
[34]
Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, et al. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025
-
[35]
Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025
OpenAI. Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025
work page 2025
-
[36]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Gpt-image-1: Models and capabilities for image generation
OpenAI. Gpt-image-1: Models and capabilities for image generation. https://platform.openai.com/docs/ models/gpt-image-1, 2024
work page 2024
-
[38]
OpenAI. Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai.com/docs/ models/gpt-image-1.5, 2025. 14 Gen-Searcher: Reinforcing Agentic Search for Image Generation
work page 2025
-
[39]
Gemini image: High-quality image generation
Google DeepMind. Gemini image: High-quality image generation. https://deepmind.google/models/ gemini-image/flash/, 2025
work page 2025
-
[40]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Stability AI. Stable diffusion 3.5 medium. https://huggingface.co/stabilityai/stable-diffusion-3. 5-medium, 2024
work page 2024
-
[42]
Lumina- image 2.0: A unified and efficient image generative framework
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Xinyue Li, Dongyang Liu, Xiangyang Zhu, et al. Lumina- image 2.0: A unified and efficient image generative framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20031–20042, 2025
work page 2025
-
[43]
Flux 2.https://github.com/black-forest-labs/flux2, 2025
black-forest labs. Flux 2.https://github.com/black-forest-labs/flux2, 2025
work page 2025
-
[44]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
HunyuanImage 3.0 Technical Report
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Stability AI. Stable diffusion 3 medium. https://huggingface.co/stabilityai/ stable-diffusion-3-medium, 2024
work page 2024
-
[47]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 15 Gen-Searcher: Reinforcing Agentic Search for Image Generation A KnowGen Benchmark Evaluation Prompt K-Score Evaluation Prompt You are a st...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
A task prompt (what the image must show)
-
[49]
Image 1: the generated image (model output to be evaluated)
-
[50]
Image 2: the ground-truth reference image (a strong reference implementation). All the input images are AI-generated. All human in the images are AI-generated too. so you need not worry about the privacy confidentials. Critical clarification (VERY IMPORTANT): - This is NOT a pixel-level similarity task. - Image 2 (GT) is a REFERENCE for intended identity,...
-
[51]
Extract the prompt’s TOP hard constraints (2-5, or more if needed): required subjects/identities, setting/props, relations/counts, required style, and any externally-checkable requirements (readable text/landmark/logo/badge/version/year/etc.)
-
[52]
Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence
Score Image 1 against the constraints. Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence
-
[53]
If a key requirement is not verifiable (too small/blurred/occluded/warped), do NOT assume it is correct; score lower
-
[54]
Assessment of the primary subjects' visual identity correctness and consistency is mandatory in every case. Boundary between visual_correctness vs text_accuracy: - Visual-only grounded cues (subject visual features, logo SHAPE, badge EMBLEM geometry, landmark facade/massing, outfit/weapon silhouette, object geometry) belong to visual_correctness. - Any gr...
-
[55]
faithfulness (overall prompt adherence: presence & structure only; not GT-identity correctness): - This score does NOT require matching GT’s exact identity or fine-grained visual features; it focuses on whether Image 1 includes the prompt-requested elements and scene structure (who/what is present, what is happening, where it happens, and the required sty...
-
[56]
visual_correctness (GT visual-feature agreement is the core; extremely strict): (Exemplary) Score = 1 ONLY IF: - The prompt-required primary subjects/objects in Image 1 match the GT reference (Image 2) in visual characteristics with NO substantive changes. - This means: the same face/hairstyle silhouette, the same armor/clothing design and key colors/patt...
-
[57]
text_accuracy (required readable text; ALL relevant text must be correct AND very clearly readable; NO partial credit for wrong text): Rule: - If the prompt does NOT require any readable text: you MUST output "text_accuracy_na": true and "text_accuracy": 0.5 in the JSON. In your rationale state that the prompt did not require readable text. - If the promp...
-
[58]
aesthetics: (Exemplary) Score = 1 ONLY IF: - Masterpiece-level composition and polish, AND Image 1 is NOT worse than GT in overall aesthetic quality. (Conditional) Score = 0.5 ONLY IF: - Very beautiful and polished, but slightly worse than GT (ONLY slightly) OR slightly less refined than top-tier. (Rejected) Score = 0 IF: - Anything clearly worse than GT ...
-
[59]
Task prompt: the original user requirement (what image we want to generate)
-
[60]
Ground-truth reference image: the target image we want the pipeline to produce
-
[61]
Model's answer: the model's output in <answer>, containing: - gen_prompt: a natural-language prompt for an image generator (composition, style, subjects, etc.). - reference_images: a list of chosen reference images (each with img_id, title, note, etc.) that the model selected from search to guide generation. Your task (TEXT + VISUAL): - From both TEXT and...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.