Recognition: unknown
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Pith reviewed 2026-05-14 21:14 UTC · model grok-4.3
The pith
Gen-Searcher trains image generators to run multi-hop searches for external knowledge and reference images before synthesis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gen-Searcher performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation, trained first with supervised fine-tuning on Gen-Searcher-SFT-10k and then with agentic reinforcement learning on Gen-Searcher-RL-6k using dual text-based and image-based rewards, producing substantial improvements on knowledge-intensive image generation tasks.
What carries the argument
Agentic search loop that interleaves reasoning, tool calls for text and image retrieval, and conditioned generation, optimized end-to-end by GRPO with dual rewards.
If this is right
- Image generators can be extended beyond their training cutoff by retrieving fresh external information at inference time.
- Dual text and image rewards supply more stable signals for training search-augmented agents than single-modality rewards.
- New benchmarks that explicitly test search-grounded generation, such as KnowGen, become necessary to measure progress on knowledge-intensive prompts.
- Open release of the SFT and RL datasets enables further work on search agents for vision tasks.
Where Pith is reading between the lines
- The same search-before-generate loop could be applied to video or 3D generation where up-to-date references are equally scarce.
- If search quality improves, the method may reduce factual hallucinations in generated images more effectively than prompt engineering alone.
- Real-time web search integration would turn the agent into a live knowledge system rather than one limited to static corpora.
Load-bearing premise
Multi-hop search will reliably return accurate, relevant textual knowledge and reference images that improve generation without injecting new errors or noise.
What would settle it
Running the same prompts through the base model versus Gen-Searcher and finding that search-augmented outputs contain more factual mistakes or lower visual fidelity on KnowGen.
Figures
read the original abstract
Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Gen-Searcher as the first search-augmented image generation agent that performs multi-hop reasoning and search to collect textual knowledge and reference images for grounded generation. It constructs two datasets (Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k), introduces the KnowGen benchmark, and trains via SFT followed by GRPO with dual text- and image-based rewards, claiming gains of ~16 points on KnowGen and ~15 points on WISE over Qwen-Image.
Significance. If the gains are shown to be robust and attributable to the agentic search rather than prompt engineering or noisy retrieval, the work would be significant as an open foundation for search-augmented image generation, providing datasets, models, and code that address the limitation of frozen knowledge in current generators.
major comments (3)
- [Data pipeline and dataset construction sections] The central claim of substantial gains rests on the assumption that multi-hop search returns accurate knowledge and references, yet the manuscript provides no quantitative audit (precision, recall, or error rate) of the retrieval pipeline used to build Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k; without this, it is impossible to rule out that observed improvements reflect benchmark-specific noise rather than reliable grounding.
- [Experiments section] Experiments report numerical gains on KnowGen and WISE but supply no details on baselines, number of evaluation runs, error bars, ablation studies isolating the contribution of dual-reward GRPO versus SFT alone, or failure cases; this absence makes the ~16-point and ~15-point improvements unverifiable from the provided evidence.
- [Training methodology (GRPO with dual rewards)] The dual-reward formulation (text-based + image-based) under GRPO is presented as providing stable signals, but no analysis is given of how the rewards interact when search returns mismatched or outdated references, leaving open whether the RL stage actually corrects retrieval noise.
minor comments (2)
- [Abstract] The abstract states gains of 'around 16 points' and '15 points'; exact per-metric scores, standard deviations, and comparison tables should be added for reproducibility.
- [Method section] Notation for the dual reward function and GRPO objective should be defined explicitly with equations rather than prose descriptions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the concerns identify gaps in the current manuscript, we commit to revisions that strengthen the evidence without altering the core claims or methodology.
read point-by-point responses
-
Referee: [Data pipeline and dataset construction sections] The central claim of substantial gains rests on the assumption that multi-hop search returns accurate knowledge and references, yet the manuscript provides no quantitative audit (precision, recall, or error rate) of the retrieval pipeline used to build Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k; without this, it is impossible to rule out that observed improvements reflect benchmark-specific noise rather than reliable grounding.
Authors: We agree that an explicit quantitative audit of the retrieval pipeline would strengthen the grounding claims. In the revised manuscript we will add a dedicated subsection under Data Pipeline that reports precision and recall (computed via manual annotation of a 500-example stratified sample) together with error-rate statistics for both textual knowledge and reference-image retrieval. These metrics will be broken down by hop depth and query type. The downstream gains on KnowGen (a benchmark explicitly constructed to require external knowledge) already provide indirect validation, but the new audit will directly address the concern that improvements could stem from benchmark-specific noise. revision: yes
-
Referee: [Experiments section] Experiments report numerical gains on KnowGen and WISE but supply no details on baselines, number of evaluation runs, error bars, ablation studies isolating the contribution of dual-reward GRPO versus SFT alone, or failure cases; this absence makes the ~16-point and ~15-point improvements unverifiable from the provided evidence.
Authors: We acknowledge that the current Experiments section lacks the requested statistical and ablation details. In the revision we will: (i) enumerate all baselines with exact model versions and prompting setups, (ii) report results averaged over three independent evaluation runs with standard-error bars, (iii) add ablations that isolate SFT-only, single-reward GRPO, and dual-reward GRPO, and (iv) include a qualitative failure-case analysis with representative examples. These additions will make the reported gains fully verifiable and will quantify the incremental benefit of the agentic RL stage. revision: yes
-
Referee: [Training methodology (GRPO with dual rewards)] The dual-reward formulation (text-based + image-based) under GRPO is presented as providing stable signals, but no analysis is given of how the rewards interact when search returns mismatched or outdated references, leaving open whether the RL stage actually corrects retrieval noise.
Authors: We appreciate the referee highlighting the missing interaction analysis. The manuscript currently motivates the dual-reward design for stability but does not examine behavior under noisy retrieval. In the revised version we will insert a new subsection that (a) plots per-step text and image reward trajectories on a held-out noisy-retrieval subset, (b) provides case studies where mismatched references occur, and (c) shows that the combined reward still yields net positive policy updates by down-weighting unreliable text signals in favor of image-based feedback. This analysis will directly demonstrate the RL stage’s capacity to mitigate retrieval noise. revision: yes
Circularity Check
No circularity detected in derivation chain
full rationale
The paper constructs datasets via an external multi-hop search pipeline, introduces the independent KnowGen benchmark, and trains using SFT followed by GRPO with dual text-based and image-based rewards. These elements rely on external search results and reward signals that are not defined in terms of the final benchmark scores. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described chain. Reported gains on KnowGen and WISE are empirical outcomes rather than forced by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 5 Pith papers
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
-
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
Reference graph
Works this paper leans on
-
[1]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Gemini image pro: High-quality image generation
Google DeepMind. Gemini image pro: High-quality image generation. https://deepmind.google/models/ gemini-image/pro/, 2025
work page 2025
-
[3]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Re-imagen: Retrieval-augmented text-to-image generator
Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022
-
[5]
Retrieval-augmented diffusion models
Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022
work page 2022
-
[6]
Zhiyou Xiao, Qinhan Yu, Binghui Li, Geng Chen, Chong Chen, and Wentao Zhang. M2io-r1: An efficient rl-enhanced reasoning framework for multimodal retrieval augmented multimodal generation.arXiv preprint arXiv:2508.06328, 2025
-
[7]
Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025
Chuanhao Li, Jianwen Sun, Yukang Feng, Mingliang Zhai, Yifan Chang, and Kaipeng Zhang. Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025
-
[8]
Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, and Weijia Li. Mind-brush: Integrating agentic cognitive search and reasoning into image generation.arXiv preprint arXiv:2602.01756, 2026
-
[9]
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
Exploring Reasoning Reward Model for Agents
Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, and Xiangyu Yue. Exploring reasoning reward model for agents.arXiv preprint arXiv:2601.22154, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025
Google DeepMind. Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025
work page 2025
-
[12]
Seed1.8 model card: Towards generalized real-world agency
Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency. https://seed.bytedance.com/en/ seed1_8, 2025. 13 Gen-Searcher: Reinforcing Agentic Search for Image Generation
work page 2025
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025
Bytedance Seed. Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025
work page 2025
-
[16]
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025
work page internal anchor Pith review arXiv 2025
-
[17]
Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025
-
[18]
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Hang Chen, Qian Xiang, Jiaxin Hu, Meilin Ye, Chao Yu, Hao Cheng, and Lei Zhang. Comprehensive exploration of diffusion models in image generation: a survey.Artificial Intelligence Review, 58(4):99, 2025
work page 2025
-
[20]
Stability AI. Stable diffusion 3.5 large. https://huggingface.co/stabilityai/stable-diffusion-3. 5-large, 2024
work page 2024
-
[21]
Imagen.https://deepmind.google/models/imagen/, 2025
Google DeepMind. Imagen.https://deepmind.google/models/imagen/, 2025
work page 2025
-
[22]
Flux 1.https://github.com/black-forest-labs/flux, 2024
black-forest labs. Flux 1.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[23]
Longcat-image technical report
Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025
-
[24]
Agentic reinforced policy optimization
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025
-
[25]
AdaTooler-V: Adaptive Tool-Use for Images and Videos
Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, et al. Adatooler-v: Adaptive tool-use for images and videos.arXiv preprint arXiv:2512.16918, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025
-
[27]
arXiv preprint arXiv:2506.03106 , year=
Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025
-
[28]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025
-
[30]
Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025
Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025
-
[31]
Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025
-
[32]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026
-
[34]
Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, et al. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025
-
[35]
Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025
OpenAI. Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025
work page 2025
-
[36]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Gpt-image-1: Models and capabilities for image generation
OpenAI. Gpt-image-1: Models and capabilities for image generation. https://platform.openai.com/docs/ models/gpt-image-1, 2024
work page 2024
-
[38]
OpenAI. Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai.com/docs/ models/gpt-image-1.5, 2025. 14 Gen-Searcher: Reinforcing Agentic Search for Image Generation
work page 2025
-
[39]
Gemini image: High-quality image generation
Google DeepMind. Gemini image: High-quality image generation. https://deepmind.google/models/ gemini-image/flash/, 2025
work page 2025
-
[40]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Stability AI. Stable diffusion 3.5 medium. https://huggingface.co/stabilityai/stable-diffusion-3. 5-medium, 2024
work page 2024
-
[42]
Lumina- image 2.0: A unified and efficient image generative framework
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Xinyue Li, Dongyang Liu, Xiangyang Zhu, et al. Lumina- image 2.0: A unified and efficient image generative framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20031–20042, 2025
work page 2025
-
[43]
Flux 2.https://github.com/black-forest-labs/flux2, 2025
black-forest labs. Flux 2.https://github.com/black-forest-labs/flux2, 2025
work page 2025
-
[44]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
-
[46]
Stability AI. Stable diffusion 3 medium. https://huggingface.co/stabilityai/ stable-diffusion-3-medium, 2024
work page 2024
-
[47]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 15 Gen-Searcher: Reinforcing Agentic Search for Image Generation A KnowGen Benchmark Evaluation Prompt K-Score Evaluation Prompt You are a st...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
A task prompt (what the image must show)
-
[49]
Image 1: the generated image (model output to be evaluated)
-
[50]
Image 2: the ground-truth reference image (a strong reference implementation). All the input images are AI-generated. All human in the images are AI-generated too. so you need not worry about the privacy confidentials. Critical clarification (VERY IMPORTANT): - This is NOT a pixel-level similarity task. - Image 2 (GT) is a REFERENCE for intended identity,...
-
[51]
Extract the prompt’s TOP hard constraints (2-5, or more if needed): required subjects/identities, setting/props, relations/counts, required style, and any externally-checkable requirements (readable text/landmark/logo/badge/version/year/etc.)
-
[52]
Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence
Score Image 1 against the constraints. Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence
-
[53]
If a key requirement is not verifiable (too small/blurred/occluded/warped), do NOT assume it is correct; score lower
-
[54]
Assessment of the primary subjects' visual identity correctness and consistency is mandatory in every case. Boundary between visual_correctness vs text_accuracy: - Visual-only grounded cues (subject visual features, logo SHAPE, badge EMBLEM geometry, landmark facade/massing, outfit/weapon silhouette, object geometry) belong to visual_correctness. - Any gr...
-
[55]
faithfulness (overall prompt adherence: presence & structure only; not GT-identity correctness): - This score does NOT require matching GT’s exact identity or fine-grained visual features; it focuses on whether Image 1 includes the prompt-requested elements and scene structure (who/what is present, what is happening, where it happens, and the required sty...
-
[56]
visual_correctness (GT visual-feature agreement is the core; extremely strict): (Exemplary) Score = 1 ONLY IF: - The prompt-required primary subjects/objects in Image 1 match the GT reference (Image 2) in visual characteristics with NO substantive changes. - This means: the same face/hairstyle silhouette, the same armor/clothing design and key colors/patt...
-
[57]
text_accuracy (required readable text; ALL relevant text must be correct AND very clearly readable; NO partial credit for wrong text): Rule: - If the prompt does NOT require any readable text: you MUST output "text_accuracy_na": true and "text_accuracy": 0.5 in the JSON. In your rationale state that the prompt did not require readable text. - If the promp...
-
[58]
aesthetics: (Exemplary) Score = 1 ONLY IF: - Masterpiece-level composition and polish, AND Image 1 is NOT worse than GT in overall aesthetic quality. (Conditional) Score = 0.5 ONLY IF: - Very beautiful and polished, but slightly worse than GT (ONLY slightly) OR slightly less refined than top-tier. (Rejected) Score = 0 IF: - Anything clearly worse than GT ...
-
[59]
Task prompt: the original user requirement (what image we want to generate)
-
[60]
Ground-truth reference image: the target image we want the pipeline to produce
-
[61]
Model's answer: the model's output in <answer>, containing: - gen_prompt: a natural-language prompt for an image generator (composition, style, subjects, etc.). - reference_images: a list of chosen reference images (each with img_id, title, note, etc.) that the model selected from search to guide generation. Your task (TEXT + VISUAL): - From both TEXT and...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.