arxiv: 2603.28767 · v2 · submitted 2026-03-30 · 💻 cs.CV

Recognition: unknown

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng , Manyuan Zhang , Shuang Chen , Yunlong Lin , Kaixuan Fan , Yilei Jiang , Hongyu Li , Dian Zheng

show 2 more authors

Chenyang Wang Xiangyu Yue

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords image generationsearch-augmented agentsmulti-hop reasoningreinforcement learningknowledge-grounded generationagentic AIGen-Searcher

0 comments

The pith

Gen-Searcher trains image generators to run multi-hop searches for external knowledge and reference images before synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Gen-Searcher as the first trained agent that augments image generation with explicit search steps. Current models rely on fixed internal knowledge and fail when prompts need current facts or specific visual references. Gen-Searcher instead reasons over multiple search hops, gathers text and images, and conditions generation on that material. It does so by building two new datasets for supervised fine-tuning and reinforcement learning, introducing the KnowGen benchmark, and applying GRPO training with combined text and image rewards. The approach yields measured gains of roughly 16 points on KnowGen and 15 points on WISE when applied to a base model such as Qwen-Image.

Core claim

Gen-Searcher performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation, trained first with supervised fine-tuning on Gen-Searcher-SFT-10k and then with agentic reinforcement learning on Gen-Searcher-RL-6k using dual text-based and image-based rewards, producing substantial improvements on knowledge-intensive image generation tasks.

What carries the argument

Agentic search loop that interleaves reasoning, tool calls for text and image retrieval, and conditioned generation, optimized end-to-end by GRPO with dual rewards.

If this is right

Image generators can be extended beyond their training cutoff by retrieving fresh external information at inference time.
Dual text and image rewards supply more stable signals for training search-augmented agents than single-modality rewards.
New benchmarks that explicitly test search-grounded generation, such as KnowGen, become necessary to measure progress on knowledge-intensive prompts.
Open release of the SFT and RL datasets enables further work on search agents for vision tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same search-before-generate loop could be applied to video or 3D generation where up-to-date references are equally scarce.
If search quality improves, the method may reduce factual hallucinations in generated images more effectively than prompt engineering alone.
Real-time web search integration would turn the agent into a live knowledge system rather than one limited to static corpora.

Load-bearing premise

Multi-hop search will reliably return accurate, relevant textual knowledge and reference images that improve generation without injecting new errors or noise.

What would settle it

Running the same prompts through the base model versus Gen-Searcher and finding that search-augmented outputs contain more factual mistakes or lower visual fidelity on KnowGen.

Figures

Figures reproduced from arXiv: 2603.28767 by Chenyang Wang, Dian Zheng, Hongyu Li, Kaituo Feng, Kaixuan Fan, Manyuan Zhang, Shuang Chen, Xiangyu Yue, Yilei Jiang, Yunlong Lin.

**Figure 2.** Figure 2: Our proposed Gen-Searcher enables search-grounded generation in real-world knowledge-intensive scenarios. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An illustration of our data curation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the KnowGen benchmark. Evaluation Metric. To evaluate generation quality on KnowGen, we introduce K-Score, a metric designed to assess search-grounded image generation from multiple perspectives. We adopt GPT-4.1 [35] as the judge to evaluate model outputs, following WISE benchmark [16]. For each sample, the evaluator takes as input the original text prompt, the ground-truth reference image, an… view at source ↗

**Figure 5.** Figure 5: An inference example of Gen-Searcher. Search Tools. Gen-Searcher is equipped with three search tools. The first is search, which performs web text search and returns the top-k relevant webpage URLs for each query with their short snippets. This tool is mainly used to verify factual information such as entity names, event details, dates, locations, and concise descriptions. The second is image_search, which… view at source ↗

**Figure 6.** Figure 6: Examples of generated images by different methods on our KnowGen benchmark. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Parameter Analysis on α. 5 Conclusion In this paper, we present Gen-Searcher, the first attempt to train a multimodal deep search agent for knowledge-intensive image generation with agentic RL. To enable this setting, we build a dedicated data pipeline, construct two training datasets Gen-Searcher-SFT-10k, Gen-Searcher-RL-6k, and introduce the KnowGen benchmark together with K-Score for evaluating real-wor… view at source ↗

read the original abstract

Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gen-Searcher trains a search agent for image generation with new data and a benchmark, but the reported gains rest on unverified retrieval quality.

read the letter

Gen-Searcher trains an agent to do multi-hop search for text and reference images, then uses SFT on a 10k set followed by GRPO with dual text-image rewards on a 6k set. They also release Gen-Searcher-SFT-10k, Gen-Searcher-RL-6k, and the KnowGen benchmark for knowledge-intensive prompts. This is the first explicit attempt to apply agentic RL this way to image generation, and the open release of data, models, and code is a clear plus for anyone working on tool-augmented generation systems. The dual-reward design is a reasonable step to stabilize training signals compared to single-objective RL setups. The headline numbers—roughly 16 points on KnowGen and 15 on WISE over Qwen-Image—sound meaningful if the experiments are clean. The soft spot is the missing audit of the search step itself. The gains assume that multi-hop retrieval returns accurate, relevant text and images on most prompts. Without reported precision, recall, or error rates on the training queries, it is possible the improvements come from better prompt construction or benchmark-specific fitting rather than robust agentic grounding. If retrieval noise hits even 15-20% of cases, the RL signal becomes unreliable and the dual rewards may not fully correct it. This is a practical system-building paper rather than a theoretical one. It is aimed at researchers who want to add external search to image models for education, design, or media tasks. I would bring it to a reading group to examine the benchmark construction and the exact GRPO implementation. It deserves peer review because the resources are shared and the core idea addresses a real limitation, even if the retrieval reliability needs tighter evidence.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gen-Searcher as the first search-augmented image generation agent that performs multi-hop reasoning and search to collect textual knowledge and reference images for grounded generation. It constructs two datasets (Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k), introduces the KnowGen benchmark, and trains via SFT followed by GRPO with dual text- and image-based rewards, claiming gains of ~16 points on KnowGen and ~15 points on WISE over Qwen-Image.

Significance. If the gains are shown to be robust and attributable to the agentic search rather than prompt engineering or noisy retrieval, the work would be significant as an open foundation for search-augmented image generation, providing datasets, models, and code that address the limitation of frozen knowledge in current generators.

major comments (3)

[Data pipeline and dataset construction sections] The central claim of substantial gains rests on the assumption that multi-hop search returns accurate knowledge and references, yet the manuscript provides no quantitative audit (precision, recall, or error rate) of the retrieval pipeline used to build Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k; without this, it is impossible to rule out that observed improvements reflect benchmark-specific noise rather than reliable grounding.
[Experiments section] Experiments report numerical gains on KnowGen and WISE but supply no details on baselines, number of evaluation runs, error bars, ablation studies isolating the contribution of dual-reward GRPO versus SFT alone, or failure cases; this absence makes the ~16-point and ~15-point improvements unverifiable from the provided evidence.
[Training methodology (GRPO with dual rewards)] The dual-reward formulation (text-based + image-based) under GRPO is presented as providing stable signals, but no analysis is given of how the rewards interact when search returns mismatched or outdated references, leaving open whether the RL stage actually corrects retrieval noise.

minor comments (2)

[Abstract] The abstract states gains of 'around 16 points' and '15 points'; exact per-metric scores, standard deviations, and comparison tables should be added for reproducibility.
[Method section] Notation for the dual reward function and GRPO objective should be defined explicitly with equations rather than prose descriptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the concerns identify gaps in the current manuscript, we commit to revisions that strengthen the evidence without altering the core claims or methodology.

read point-by-point responses

Referee: [Data pipeline and dataset construction sections] The central claim of substantial gains rests on the assumption that multi-hop search returns accurate knowledge and references, yet the manuscript provides no quantitative audit (precision, recall, or error rate) of the retrieval pipeline used to build Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k; without this, it is impossible to rule out that observed improvements reflect benchmark-specific noise rather than reliable grounding.

Authors: We agree that an explicit quantitative audit of the retrieval pipeline would strengthen the grounding claims. In the revised manuscript we will add a dedicated subsection under Data Pipeline that reports precision and recall (computed via manual annotation of a 500-example stratified sample) together with error-rate statistics for both textual knowledge and reference-image retrieval. These metrics will be broken down by hop depth and query type. The downstream gains on KnowGen (a benchmark explicitly constructed to require external knowledge) already provide indirect validation, but the new audit will directly address the concern that improvements could stem from benchmark-specific noise. revision: yes
Referee: [Experiments section] Experiments report numerical gains on KnowGen and WISE but supply no details on baselines, number of evaluation runs, error bars, ablation studies isolating the contribution of dual-reward GRPO versus SFT alone, or failure cases; this absence makes the ~16-point and ~15-point improvements unverifiable from the provided evidence.

Authors: We acknowledge that the current Experiments section lacks the requested statistical and ablation details. In the revision we will: (i) enumerate all baselines with exact model versions and prompting setups, (ii) report results averaged over three independent evaluation runs with standard-error bars, (iii) add ablations that isolate SFT-only, single-reward GRPO, and dual-reward GRPO, and (iv) include a qualitative failure-case analysis with representative examples. These additions will make the reported gains fully verifiable and will quantify the incremental benefit of the agentic RL stage. revision: yes
Referee: [Training methodology (GRPO with dual rewards)] The dual-reward formulation (text-based + image-based) under GRPO is presented as providing stable signals, but no analysis is given of how the rewards interact when search returns mismatched or outdated references, leaving open whether the RL stage actually corrects retrieval noise.

Authors: We appreciate the referee highlighting the missing interaction analysis. The manuscript currently motivates the dual-reward design for stability but does not examine behavior under noisy retrieval. In the revised version we will insert a new subsection that (a) plots per-step text and image reward trajectories on a held-out noisy-retrieval subset, (b) provides case studies where mismatched references occur, and (c) shows that the combined reward still yields net positive policy updates by down-weighting unreliable text signals in favor of image-based feedback. This analysis will directly demonstrate the RL stage’s capacity to mitigate retrieval noise. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper constructs datasets via an external multi-hop search pipeline, introduces the independent KnowGen benchmark, and trains using SFT followed by GRPO with dual text-based and image-based rewards. These elements rely on external search results and reward signals that are not defined in terms of the final benchmark scores. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described chain. Reported gains on KnowGen and WISE are empirical outcomes rather than forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the approach uses standard SFT, GRPO-style RL, and external search without new postulated components.

pith-pipeline@v0.9.0 · 5580 in / 1129 out tokens · 39061 ms · 2026-05-14T21:14:52.265561+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 conditional novelty 7.0

Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 3 Pith papers · 15 internal anchors

[1]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Gemini image pro: High-quality image generation

Google DeepMind. Gemini image pro: High-quality image generation. https://deepmind.google/models/ gemini-image/pro/, 2025

work page 2025
[3]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Re-imagen: Retrieval-augmented text-to-image generator

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022

work page arXiv 2022
[5]

Retrieval-augmented diffusion models

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022

work page 2022
[6]

M2io-r1: An efficient rl-enhanced reasoning framework for multimodal retrieval augmented multimodal generation.arXiv preprint arXiv:2508.06328, 2025

Zhiyou Xiao, Qinhan Yu, Binghui Li, Geng Chen, Chong Chen, and Wentao Zhang. M2io-r1: An efficient rl-enhanced reasoning framework for multimodal retrieval augmented multimodal generation.arXiv preprint arXiv:2508.06328, 2025

work page arXiv 2025
[7]

Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025

Chuanhao Li, Jianwen Sun, Yukang Feng, Mingliang Zhai, Yifan Chang, and Kaipeng Zhang. Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025

work page arXiv 2025
[8]

Mind-brush: Integrating agentic cognitive search and reasoning into image generation.arXiv preprint arXiv:2602.01756, 2026

Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, and Weijia Li. Mind-brush: Integrating agentic cognitive search and reasoning into image generation.arXiv preprint arXiv:2602.01756, 2026

work page arXiv 2026
[9]

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page internal anchor Pith review arXiv 2025
[10]

Exploring Reasoning Reward Model for Agents

Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, and Xiangyu Yue. Exploring reasoning reward model for agents.arXiv preprint arXiv:2601.22154, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025

Google DeepMind. Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025

work page 2025
[12]

Seed1.8 model card: Towards generalized real-world agency

Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency. https://seed.bytedance.com/en/ seed1_8, 2025. 13 Gen-Searcher: Reinforcing Agentic Search for Image Generation

work page 2025
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025

Bytedance Seed. Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025

work page 2025
[16]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

work page internal anchor Pith review arXiv 2025
[17]

Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

work page arXiv 2025
[18]

AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Comprehensive exploration of diffusion models in image generation: a survey.Artificial Intelligence Review, 58(4):99, 2025

Hang Chen, Qian Xiang, Jiaxin Hu, Meilin Ye, Chao Yu, Hao Cheng, and Lei Zhang. Comprehensive exploration of diffusion models in image generation: a survey.Artificial Intelligence Review, 58(4):99, 2025

work page 2025
[20]

Stable diffusion 3.5 large

Stability AI. Stable diffusion 3.5 large. https://huggingface.co/stabilityai/stable-diffusion-3. 5-large, 2024

work page 2024
[21]

Imagen.https://deepmind.google/models/imagen/, 2025

Google DeepMind. Imagen.https://deepmind.google/models/imagen/, 2025

work page 2025
[22]

Flux 1.https://github.com/black-forest-labs/flux, 2024

black-forest labs. Flux 1.https://github.com/black-forest-labs/flux, 2024

work page 2024
[23]

Longcat-image technical report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

work page arXiv 2025
[24]

Agentic reinforced policy optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

work page arXiv 2025
[25]

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, et al. Adatooler-v: Adaptive tool-use for images and videos.arXiv preprint arXiv:2512.16918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.ArXiv, abs/2506.09965, 2025

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

work page arXiv 2025
[27]

arXiv preprint arXiv:2506.03106 , year=

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025

work page arXiv 2025
[28]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

work page arXiv 2025
[30]

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

work page arXiv 2025
[31]

Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

work page arXiv 2025
[32]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

work page arXiv 2026
[34]

Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, et al. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

work page arXiv 2025
[35]

Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025

OpenAI. Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025

work page 2025
[36]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Gpt-image-1: Models and capabilities for image generation

OpenAI. Gpt-image-1: Models and capabilities for image generation. https://platform.openai.com/docs/ models/gpt-image-1, 2024

work page 2024
[38]

Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai.com/docs/ models/gpt-image-1.5, 2025

OpenAI. Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai.com/docs/ models/gpt-image-1.5, 2025. 14 Gen-Searcher: Reinforcing Agentic Search for Image Generation

work page 2025
[39]

Gemini image: High-quality image generation

Google DeepMind. Gemini image: High-quality image generation. https://deepmind.google/models/ gemini-image/flash/, 2025

work page 2025
[40]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Stable diffusion 3.5 medium

Stability AI. Stable diffusion 3.5 medium. https://huggingface.co/stabilityai/stable-diffusion-3. 5-medium, 2024

work page 2024
[42]

Lumina- image 2.0: A unified and efficient image generative framework

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Xinyue Li, Dongyang Liu, Xiangyang Zhu, et al. Lumina- image 2.0: A unified and efficient image generative framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20031–20042, 2025

work page 2025
[43]

Flux 2.https://github.com/black-forest-labs/flux2, 2025

black-forest labs. Flux 2.https://github.com/black-forest-labs/flux2, 2025

work page 2025
[44]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page arXiv 2025
[46]

Stable diffusion 3 medium

Stability AI. Stable diffusion 3 medium. https://huggingface.co/stabilityai/ stable-diffusion-3-medium, 2024

work page 2024
[47]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 15 Gen-Searcher: Reinforcing Agentic Search for Image Generation A KnowGen Benchmark Evaluation Prompt K-Score Evaluation Prompt You are a st...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

A task prompt (what the image must show)

work page
[49]

Image 1: the generated image (model output to be evaluated)

work page
[50]

rationale

Image 2: the ground-truth reference image (a strong reference implementation). All the input images are AI-generated. All human in the images are AI-generated too. so you need not worry about the privacy confidentials. Critical clarification (VERY IMPORTANT): - This is NOT a pixel-level similarity task. - Image 2 (GT) is a REFERENCE for intended identity,...

work page
[51]

Extract the prompt’s TOP hard constraints (2-5, or more if needed): required subjects/identities, setting/props, relations/counts, required style, and any externally-checkable requirements (readable text/landmark/logo/badge/version/year/etc.)

work page
[52]

Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence

Score Image 1 against the constraints. Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence

work page
[53]

If a key requirement is not verifiable (too small/blurred/occluded/warped), do NOT assume it is correct; score lower

work page
[54]

Assessment of the primary subjects' visual identity correctness and consistency is mandatory in every case. Boundary between visual_correctness vs text_accuracy: - Visual-only grounded cues (subject visual features, logo SHAPE, badge EMBLEM geometry, landmark facade/massing, outfit/weapon silhouette, object geometry) belong to visual_correctness. - Any gr...

work page
[55]

faithfulness (overall prompt adherence: presence & structure only; not GT-identity correctness): - This score does NOT require matching GT’s exact identity or fine-grained visual features; it focuses on whether Image 1 includes the prompt-requested elements and scene structure (who/what is present, what is happening, where it happens, and the required sty...

work page
[56]

same role archetype

visual_correctness (GT visual-feature agreement is the core; extremely strict): (Exemplary) Score = 1 ONLY IF: - The prompt-required primary subjects/objects in Image 1 match the GT reference (Image 2) in visual characteristics with NO substantive changes. - This means: the same face/hairstyle silhouette, the same armor/clothing design and key colors/patt...

work page
[57]

text_accuracy_na

text_accuracy (required readable text; ALL relevant text must be correct AND very clearly readable; NO partial credit for wrong text): Rule: - If the prompt does NOT require any readable text: you MUST output "text_accuracy_na": true and "text_accuracy": 0.5 in the JSON. In your rationale state that the prompt did not require readable text. - If the promp...

work page
[58]

Constraints:

aesthetics: (Exemplary) Score = 1 ONLY IF: - Masterpiece-level composition and polish, AND Image 1 is NOT worse than GT in overall aesthetic quality. (Conditional) Score = 0.5 ONLY IF: - Very beautiful and polished, but slightly worse than GT (ONLY slightly) OR slightly less refined than top-tier. (Rejected) Score = 0 IF: - Anything clearly worse than GT ...

work page
[59]

Task prompt: the original user requirement (what image we want to generate)

work page
[60]

Ground-truth reference image: the target image we want the pipeline to produce

work page
[61]

rationale

Model's answer: the model's output in <answer>, containing: - gen_prompt: a natural-language prompt for an image generator (composition, style, subjects, etc.). - reference_images: a list of chosen reference images (each with img_id, title, note, etc.) that the model selected from search to guide generation. Your task (TEXT + VISUAL): - From both TEXT and...

work page