pith. machine review for the scientific record. sign in

arxiv: 2603.28767 · v2 · submitted 2026-03-30 · 💻 cs.CV

Recognition: unknown

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords image generationsearch-augmented agentsmulti-hop reasoningreinforcement learningknowledge-grounded generationagentic AIGen-Searcher
0
0 comments X

The pith

Gen-Searcher trains image generators to run multi-hop searches for external knowledge and reference images before synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Gen-Searcher as the first trained agent that augments image generation with explicit search steps. Current models rely on fixed internal knowledge and fail when prompts need current facts or specific visual references. Gen-Searcher instead reasons over multiple search hops, gathers text and images, and conditions generation on that material. It does so by building two new datasets for supervised fine-tuning and reinforcement learning, introducing the KnowGen benchmark, and applying GRPO training with combined text and image rewards. The approach yields measured gains of roughly 16 points on KnowGen and 15 points on WISE when applied to a base model such as Qwen-Image.

Core claim

Gen-Searcher performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation, trained first with supervised fine-tuning on Gen-Searcher-SFT-10k and then with agentic reinforcement learning on Gen-Searcher-RL-6k using dual text-based and image-based rewards, producing substantial improvements on knowledge-intensive image generation tasks.

What carries the argument

Agentic search loop that interleaves reasoning, tool calls for text and image retrieval, and conditioned generation, optimized end-to-end by GRPO with dual rewards.

If this is right

  • Image generators can be extended beyond their training cutoff by retrieving fresh external information at inference time.
  • Dual text and image rewards supply more stable signals for training search-augmented agents than single-modality rewards.
  • New benchmarks that explicitly test search-grounded generation, such as KnowGen, become necessary to measure progress on knowledge-intensive prompts.
  • Open release of the SFT and RL datasets enables further work on search agents for vision tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same search-before-generate loop could be applied to video or 3D generation where up-to-date references are equally scarce.
  • If search quality improves, the method may reduce factual hallucinations in generated images more effectively than prompt engineering alone.
  • Real-time web search integration would turn the agent into a live knowledge system rather than one limited to static corpora.

Load-bearing premise

Multi-hop search will reliably return accurate, relevant textual knowledge and reference images that improve generation without injecting new errors or noise.

What would settle it

Running the same prompts through the base model versus Gen-Searcher and finding that search-augmented outputs contain more factual mistakes or lower visual fidelity on KnowGen.

Figures

Figures reproduced from arXiv: 2603.28767 by Chenyang Wang, Dian Zheng, Hongyu Li, Kaituo Feng, Kaixuan Fan, Manyuan Zhang, Shuang Chen, Xiangyu Yue, Yilei Jiang, Yunlong Lin.

Figure 1
Figure 1. Figure 1: Generated images using our proposed Gen-Searcher. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our proposed Gen-Searcher enables search-grounded generation in real-world knowledge-intensive scenarios. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of our data curation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the KnowGen benchmark. Evaluation Metric. To evaluate generation quality on KnowGen, we introduce K-Score, a metric designed to assess search-grounded image generation from multiple perspectives. We adopt GPT-4.1 [35] as the judge to evaluate model outputs, following WISE benchmark [16]. For each sample, the evaluator takes as input the original text prompt, the ground-truth reference image, an… view at source ↗
Figure 5
Figure 5. Figure 5: An inference example of Gen-Searcher. Search Tools. Gen-Searcher is equipped with three search tools. The first is search, which performs web text search and returns the top-k relevant webpage URLs for each query with their short snippets. This tool is mainly used to verify factual information such as entity names, event details, dates, locations, and concise descriptions. The second is image_search, which… view at source ↗
Figure 6
Figure 6. Figure 6: Examples of generated images by different methods on our KnowGen benchmark. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Parameter Analysis on α. 5 Conclusion In this paper, we present Gen-Searcher, the first attempt to train a multimodal deep search agent for knowledge-intensive image generation with agentic RL. To enable this setting, we build a dedicated data pipeline, construct two training datasets Gen-Searcher-SFT-10k, Gen-Searcher-RL-6k, and introduce the KnowGen benchmark together with K-Score for evaluating real-wor… view at source ↗
read the original abstract

Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gen-Searcher as the first search-augmented image generation agent that performs multi-hop reasoning and search to collect textual knowledge and reference images for grounded generation. It constructs two datasets (Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k), introduces the KnowGen benchmark, and trains via SFT followed by GRPO with dual text- and image-based rewards, claiming gains of ~16 points on KnowGen and ~15 points on WISE over Qwen-Image.

Significance. If the gains are shown to be robust and attributable to the agentic search rather than prompt engineering or noisy retrieval, the work would be significant as an open foundation for search-augmented image generation, providing datasets, models, and code that address the limitation of frozen knowledge in current generators.

major comments (3)
  1. [Data pipeline and dataset construction sections] The central claim of substantial gains rests on the assumption that multi-hop search returns accurate knowledge and references, yet the manuscript provides no quantitative audit (precision, recall, or error rate) of the retrieval pipeline used to build Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k; without this, it is impossible to rule out that observed improvements reflect benchmark-specific noise rather than reliable grounding.
  2. [Experiments section] Experiments report numerical gains on KnowGen and WISE but supply no details on baselines, number of evaluation runs, error bars, ablation studies isolating the contribution of dual-reward GRPO versus SFT alone, or failure cases; this absence makes the ~16-point and ~15-point improvements unverifiable from the provided evidence.
  3. [Training methodology (GRPO with dual rewards)] The dual-reward formulation (text-based + image-based) under GRPO is presented as providing stable signals, but no analysis is given of how the rewards interact when search returns mismatched or outdated references, leaving open whether the RL stage actually corrects retrieval noise.
minor comments (2)
  1. [Abstract] The abstract states gains of 'around 16 points' and '15 points'; exact per-metric scores, standard deviations, and comparison tables should be added for reproducibility.
  2. [Method section] Notation for the dual reward function and GRPO objective should be defined explicitly with equations rather than prose descriptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the concerns identify gaps in the current manuscript, we commit to revisions that strengthen the evidence without altering the core claims or methodology.

read point-by-point responses
  1. Referee: [Data pipeline and dataset construction sections] The central claim of substantial gains rests on the assumption that multi-hop search returns accurate knowledge and references, yet the manuscript provides no quantitative audit (precision, recall, or error rate) of the retrieval pipeline used to build Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k; without this, it is impossible to rule out that observed improvements reflect benchmark-specific noise rather than reliable grounding.

    Authors: We agree that an explicit quantitative audit of the retrieval pipeline would strengthen the grounding claims. In the revised manuscript we will add a dedicated subsection under Data Pipeline that reports precision and recall (computed via manual annotation of a 500-example stratified sample) together with error-rate statistics for both textual knowledge and reference-image retrieval. These metrics will be broken down by hop depth and query type. The downstream gains on KnowGen (a benchmark explicitly constructed to require external knowledge) already provide indirect validation, but the new audit will directly address the concern that improvements could stem from benchmark-specific noise. revision: yes

  2. Referee: [Experiments section] Experiments report numerical gains on KnowGen and WISE but supply no details on baselines, number of evaluation runs, error bars, ablation studies isolating the contribution of dual-reward GRPO versus SFT alone, or failure cases; this absence makes the ~16-point and ~15-point improvements unverifiable from the provided evidence.

    Authors: We acknowledge that the current Experiments section lacks the requested statistical and ablation details. In the revision we will: (i) enumerate all baselines with exact model versions and prompting setups, (ii) report results averaged over three independent evaluation runs with standard-error bars, (iii) add ablations that isolate SFT-only, single-reward GRPO, and dual-reward GRPO, and (iv) include a qualitative failure-case analysis with representative examples. These additions will make the reported gains fully verifiable and will quantify the incremental benefit of the agentic RL stage. revision: yes

  3. Referee: [Training methodology (GRPO with dual rewards)] The dual-reward formulation (text-based + image-based) under GRPO is presented as providing stable signals, but no analysis is given of how the rewards interact when search returns mismatched or outdated references, leaving open whether the RL stage actually corrects retrieval noise.

    Authors: We appreciate the referee highlighting the missing interaction analysis. The manuscript currently motivates the dual-reward design for stability but does not examine behavior under noisy retrieval. In the revised version we will insert a new subsection that (a) plots per-step text and image reward trajectories on a held-out noisy-retrieval subset, (b) provides case studies where mismatched references occur, and (c) shows that the combined reward still yields net positive policy updates by down-weighting unreliable text signals in favor of image-based feedback. This analysis will directly demonstrate the RL stage’s capacity to mitigate retrieval noise. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper constructs datasets via an external multi-hop search pipeline, introduces the independent KnowGen benchmark, and trains using SFT followed by GRPO with dual text-based and image-based rewards. These elements rely on external search results and reward signals that are not defined in terms of the final benchmark scores. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described chain. Reported gains on KnowGen and WISE are empirical outcomes rather than forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the approach uses standard SFT, GRPO-style RL, and external search without new postulated components.

pith-pipeline@v0.9.0 · 5580 in / 1129 out tokens · 39061 ms · 2026-05-14T21:14:52.265561+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Web to Pixels: Bringing Agentic Search into Visual Perception

    cs.CV 2026-05 unverdicted novelty 7.0

    WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

  2. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 conditional novelty 7.0

    Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...

  3. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.

  4. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.

  5. SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 3 Pith papers · 15 internal anchors

  1. [1]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  2. [2]

    Gemini image pro: High-quality image generation

    Google DeepMind. Gemini image pro: High-quality image generation. https://deepmind.google/models/ gemini-image/pro/, 2025

  3. [3]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  4. [4]

    Re-imagen: Retrieval-augmented text-to-image generator

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022

  5. [5]

    Retrieval-augmented diffusion models

    Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022

  6. [6]

    M2io-r1: An efficient rl-enhanced reasoning framework for multimodal retrieval augmented multimodal generation.arXiv preprint arXiv:2508.06328, 2025

    Zhiyou Xiao, Qinhan Yu, Binghui Li, Geng Chen, Chong Chen, and Wentao Zhang. M2io-r1: An efficient rl-enhanced reasoning framework for multimodal retrieval augmented multimodal generation.arXiv preprint arXiv:2508.06328, 2025

  7. [7]

    Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025

    Chuanhao Li, Jianwen Sun, Yukang Feng, Mingliang Zhai, Yifan Chang, and Kaipeng Zhang. Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025

  8. [8]

    Mind-brush: Integrating agentic cognitive search and reasoning into image generation.arXiv preprint arXiv:2602.01756, 2026

    Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, and Weijia Li. Mind-brush: Integrating agentic cognitive search and reasoning into image generation.arXiv preprint arXiv:2602.01756, 2026

  9. [9]

    WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

  10. [10]

    Exploring Reasoning Reward Model for Agents

    Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, and Xiangyu Yue. Exploring reasoning reward model for agents.arXiv preprint arXiv:2601.22154, 2026

  11. [11]

    Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025

    Google DeepMind. Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025

  12. [12]

    Seed1.8 model card: Towards generalized real-world agency

    Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency. https://seed.bytedance.com/en/ seed1_8, 2025. 13 Gen-Searcher: Reinforcing Agentic Search for Image Generation

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  14. [14]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

  15. [15]

    Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025

    Bytedance Seed. Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025

  16. [16]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

  17. [17]

    Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

    Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

  18. [18]

    AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

    Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025

  19. [19]

    Comprehensive exploration of diffusion models in image generation: a survey.Artificial Intelligence Review, 58(4):99, 2025

    Hang Chen, Qian Xiang, Jiaxin Hu, Meilin Ye, Chao Yu, Hao Cheng, and Lei Zhang. Comprehensive exploration of diffusion models in image generation: a survey.Artificial Intelligence Review, 58(4):99, 2025

  20. [20]

    Stable diffusion 3.5 large

    Stability AI. Stable diffusion 3.5 large. https://huggingface.co/stabilityai/stable-diffusion-3. 5-large, 2024

  21. [21]

    Imagen.https://deepmind.google/models/imagen/, 2025

    Google DeepMind. Imagen.https://deepmind.google/models/imagen/, 2025

  22. [22]

    Flux 1.https://github.com/black-forest-labs/flux, 2024

    black-forest labs. Flux 1.https://github.com/black-forest-labs/flux, 2024

  23. [23]

    Longcat-image technical report

    Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

  24. [24]

    Agentic reinforced policy optimization

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

  25. [25]

    AdaTooler-V: Adaptive Tool-Use for Images and Videos

    Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, et al. Adatooler-v: Adaptive tool-use for images and videos.arXiv preprint arXiv:2512.16918, 2025

  26. [26]

    Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.ArXiv, abs/2506.09965, 2025

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

  27. [27]

    arXiv preprint arXiv:2506.03106 , year=

    Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025

  28. [28]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  29. [29]

    Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

    Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

  30. [30]

    Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

    Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

  31. [31]

    Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

    Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

  32. [32]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

  33. [33]

    Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

    Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

  34. [34]

    Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

    Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, et al. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

  35. [35]

    Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025

    OpenAI. Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025

  36. [36]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  37. [37]

    Gpt-image-1: Models and capabilities for image generation

    OpenAI. Gpt-image-1: Models and capabilities for image generation. https://platform.openai.com/docs/ models/gpt-image-1, 2024

  38. [38]

    Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai.com/docs/ models/gpt-image-1.5, 2025

    OpenAI. Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai.com/docs/ models/gpt-image-1.5, 2025. 14 Gen-Searcher: Reinforcing Agentic Search for Image Generation

  39. [39]

    Gemini image: High-quality image generation

    Google DeepMind. Gemini image: High-quality image generation. https://deepmind.google/models/ gemini-image/flash/, 2025

  40. [40]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  41. [41]

    Stable diffusion 3.5 medium

    Stability AI. Stable diffusion 3.5 medium. https://huggingface.co/stabilityai/stable-diffusion-3. 5-medium, 2024

  42. [42]

    Lumina- image 2.0: A unified and efficient image generative framework

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Xinyue Li, Dongyang Liu, Xiangyang Zhu, et al. Lumina- image 2.0: A unified and efficient image generative framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20031–20042, 2025

  43. [43]

    Flux 2.https://github.com/black-forest-labs/flux2, 2025

    black-forest labs. Flux 2.https://github.com/black-forest-labs/flux2, 2025

  44. [44]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  45. [45]

    Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

  46. [46]

    Stable diffusion 3 medium

    Stability AI. Stable diffusion 3 medium. https://huggingface.co/stabilityai/ stable-diffusion-3-medium, 2024

  47. [47]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 15 Gen-Searcher: Reinforcing Agentic Search for Image Generation A KnowGen Benchmark Evaluation Prompt K-Score Evaluation Prompt You are a st...

  48. [48]

    A task prompt (what the image must show)

  49. [49]

    Image 1: the generated image (model output to be evaluated)

  50. [50]

    rationale

    Image 2: the ground-truth reference image (a strong reference implementation). All the input images are AI-generated. All human in the images are AI-generated too. so you need not worry about the privacy confidentials. Critical clarification (VERY IMPORTANT): - This is NOT a pixel-level similarity task. - Image 2 (GT) is a REFERENCE for intended identity,...

  51. [51]

    Extract the prompt’s TOP hard constraints (2-5, or more if needed): required subjects/identities, setting/props, relations/counts, required style, and any externally-checkable requirements (readable text/landmark/logo/badge/version/year/etc.)

  52. [52]

    Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence

    Score Image 1 against the constraints. Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence

  53. [53]

    If a key requirement is not verifiable (too small/blurred/occluded/warped), do NOT assume it is correct; score lower

  54. [54]

    Assessment of the primary subjects' visual identity correctness and consistency is mandatory in every case. Boundary between visual_correctness vs text_accuracy: - Visual-only grounded cues (subject visual features, logo SHAPE, badge EMBLEM geometry, landmark facade/massing, outfit/weapon silhouette, object geometry) belong to visual_correctness. - Any gr...

  55. [55]

    faithfulness (overall prompt adherence: presence & structure only; not GT-identity correctness): - This score does NOT require matching GT’s exact identity or fine-grained visual features; it focuses on whether Image 1 includes the prompt-requested elements and scene structure (who/what is present, what is happening, where it happens, and the required sty...

  56. [56]

    same role archetype

    visual_correctness (GT visual-feature agreement is the core; extremely strict): (Exemplary) Score = 1 ONLY IF: - The prompt-required primary subjects/objects in Image 1 match the GT reference (Image 2) in visual characteristics with NO substantive changes. - This means: the same face/hairstyle silhouette, the same armor/clothing design and key colors/patt...

  57. [57]

    text_accuracy_na

    text_accuracy (required readable text; ALL relevant text must be correct AND very clearly readable; NO partial credit for wrong text): Rule: - If the prompt does NOT require any readable text: you MUST output "text_accuracy_na": true and "text_accuracy": 0.5 in the JSON. In your rationale state that the prompt did not require readable text. - If the promp...

  58. [58]

    Constraints:

    aesthetics: (Exemplary) Score = 1 ONLY IF: - Masterpiece-level composition and polish, AND Image 1 is NOT worse than GT in overall aesthetic quality. (Conditional) Score = 0.5 ONLY IF: - Very beautiful and polished, but slightly worse than GT (ONLY slightly) OR slightly less refined than top-tier. (Rejected) Score = 0 IF: - Anything clearly worse than GT ...

  59. [59]

    Task prompt: the original user requirement (what image we want to generate)

  60. [60]

    Ground-truth reference image: the target image we want the pipeline to produce

  61. [61]

    rationale

    Model's answer: the model's output in <answer>, containing: - gen_prompt: a natural-language prompt for an image generator (composition, style, subjects, etc.). - reference_images: a list of chosen reference images (each with img_id, title, note, etc.) that the model selected from search to guide generation. Your task (TEXT + VISUAL): - From both TEXT and...