arxiv: 2605.12497 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

From Web to Pixels: Bringing Agentic Search into Visual Perception

Bokang Yang, Dongming Wu, Kaituo Feng, Xiangyu Yue, Xingping Dong, Xinyi Sun

Pith reviewed 2026-05-13 05:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual groundingimage segmentationvisual question answeringagentic searchweb searchobject detectionbenchmark datasetopen-world perception

0 comments

The pith

Pixel-Searcher resolves hidden object identities from external sources before localizing them at the pixel level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most visual perception assumes the needed facts sit inside the given image or frozen model weights. This work examines the open case where a visible target must first be identified through external facts, recent events, long-tail entities, or multi-hop relations. It formalizes the setting as Perception Deep Research and releases WebEye, a benchmark of 120 images with 473 annotated instances, 645 QA pairs, and three task views that demand verifiable evidence. Pixel-Searcher, an agentic search-to-pixel pipeline, is introduced to acquire evidence, resolve identities, and bind them to boxes, masks, or answers. Experiments show it leads all open-source baselines on the three views, with remaining errors concentrated in evidence acquisition, identity resolution, and visual binding.

Core claim

The paper establishes Perception Deep Research as the setting where visible objects require external resolution before localization, presents the WebEye benchmark with verifiable evidence and annotations, and demonstrates that the Pixel-Searcher agentic workflow outperforms other open-source methods on search-based grounding, segmentation, and VQA by resolving identities and binding them visually.

What carries the argument

Pixel-Searcher, the agentic search-to-pixel workflow that resolves hidden target identities from web sources and binds them to boxes, masks, or grounded answers.

If this is right

Pixel-Searcher achieves the strongest open-source performance across Search-based Grounding, Search-based Segmentation, and Search-based VQA.
Remaining failures concentrate in evidence acquisition, identity resolution, and visual instance binding.
The benchmark supplies verifiable evidence chains, precise box and mask labels, and knowledge-intensive queries across 120 images and 473 instances.
Three distinct task views are defined that all require external resolution prior to pixel-level output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tighter coupling between search modules and visual grounding could reduce the observed binding errors.
The same workflow pattern may extend to video sequences or real-time camera feeds where targets change with current events.
Long-tail and multi-hop identity resolution remains a bottleneck that future agent designs must target explicitly.

Load-bearing premise

The decisive evidence for identifying a target is not already in the image or model knowledge but must be resolved from external facts, recent events, long-tail entities, or multi-hop relations before localization is possible.

What would settle it

A non-agentic model that reaches comparable accuracy on all three WebEye tasks without external evidence acquisition would falsify the claim that external resolution is required.

Figures

Figures reproduced from arXiv: 2605.12497 by Bokang Yang, Dongming Wu, Kaituo Feng, Xiangyu Yue, Xingping Dong, Xinyi Sun.

**Figure 1.** Figure 1: Our Perception Deep Research extends grounded perception from visual-cue reference and internal-knowledge reasoning to web-knowledge search. in recognizing the entity. This gap motivates Perception Deep Research for open-world visual perception, where models must actively seek external evidence, resolve the hidden identity behind a query, and ground it to concrete visual outputs. To make Perception Deep Re… view at source ↗

**Figure 2.** Figure 2: Overview of WebEyes generation and Pixel-Searcher inference. 2 Related Work Visual perception with language. Language-guided visual perception spans referring expression comprehension, phrase grounding, and segmentation. RefCOCO-style referring expression comprehension established a common setting where a model localizes an object from a natural-language expression and uses contextual relations among obje… view at source ↗

**Figure 3.** Figure 3: Examples of WebEyes task views: Search-based Segmentation outputs a mask, Search-based Grounding outputs a grounded region, and Search-based VQA selects the correct description for a highlighted target. 3 WebEyes Benchmark Perception Deep Research asks a model to find a hidden target using external evidence and connect it to a precise visual output. Given an image and a knowledge-intensive query, the model… view at source ↗

**Figure 4.** Figure 4: WebEyes Category distribution. Benchmark Knowledge Source Capability FineGrained Common Web Ground. Seg. QA RefCOCO [20] × × ✓ ✓ × ✓ ReasonSeg [9] ✓ × × ✓ × ✓ MMSearch [13] ✓ ✓ × × ✓ × BrowseComp-VL [19] ✓ ✓ × × ✓ × WebEyes ✓ ✓ ✓ ✓ ✓ ✓ [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Automated WebEyes construction pipeline. The workflow annotates objects, links entities, searches evidence, generates questions, and filters shortcuts. 4 Pixel-Searcher: An Agentic Search-to-Pixel Workflow Pixel-Searcher is a reference workflow for Perception Deep Research. Instead of treating a knowledge-intensive query as a direct grounding prompt, it converts the task into an agentic search-to-pixel pro… view at source ↗

**Figure 6.** Figure 6: Pixel-Searcher overview. Forward tasks resolve the hidden entity and ground it to a box; Searchbased VQA matches answer options to the highlighted region. The loop is bounded by a maximum number of rounds, but the path is adaptive: simple queries may require one factual lookup, while harder queries may require connecting multiple pieces of evidence. Let E1:T denote the evidence collected within at most T … view at source ↗

read the original abstract

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds external web search to visual perception via a new benchmark and agentic pipeline, which is a useful practical step but limited by small scale.

read the letter

This paper's core move is to add external web search to visual perception tasks, creating a benchmark where models have to look up facts to identify and localize objects in images. They do a good job defining Perception Deep Research and releasing WebEye with its object-anchored queries, annotations, and three task views. Pixel-Searcher as an agentic workflow that searches then grounds seems to deliver the best open-source results on their 1,927 samples. The failure breakdown into evidence acquisition, identity resolution, and binding is useful. The soft spots are minor but worth noting. The dataset has only 120 images, which is limited for claiming broad open-world performance. Without seeing the full tables, it's hard to judge how much the gains come from the search integration versus other factors, and whether the baselines were adapted properly to the new setting. The claims rest on this new benchmark, so independent verification on other data would strengthen it. Overall, this is for researchers in vision-language models who want to move beyond closed-world assumptions. It shows clear thinking on a real limitation and provides the tools to study it. I would bring it to a reading group to discuss the agentic angle. It deserves peer review; the idea is timely and the execution looks solid enough to warrant feedback.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Perception Deep Research as a setting for visual perception where identifying a visible target requires resolving external facts, recent events, long-tail entities, or multi-hop relations via web search before localization or answering is possible. It presents the WebEye benchmark (120 images, 473 annotated object instances, 645 unique QA pairs, 1,927 task samples) with three task views—Search-based Grounding, Search-based Segmentation, and Search-based VQA—along with verifiable evidence and precise box/mask annotations. The authors propose Pixel-Searcher, an agentic search-to-pixel workflow, and report that it achieves the strongest open-source performance across the three views, with primary failure modes in evidence acquisition, identity resolution, and visual instance binding.

Significance. If the reported results hold, the work provides a concrete bridge between web-scale search and pixel-level visual tasks, filling a gap left by models that assume all necessary evidence is already present in the image or frozen parameters. The new benchmark supplies a reproducible, object-anchored testbed with knowledge-intensive queries and fine-grained annotations, while the failure categorization offers actionable diagnostics for future agentic systems. These elements constitute a useful resource and methodological template for multimodal research that must handle open-world, externally grounded perception.

minor comments (3)

Abstract: the benchmark is introduced as 'WebEye' but then referred to as 'WebEyes contains 120 images...'; standardize the name throughout for consistency.
Abstract and Experiments: while the full manuscript supplies direct comparisons on the 1,927 samples, the abstract itself contains no quantitative metrics, baselines, or error bars; adding a compact results table or key numbers would strengthen the headline claim for readers who stop at the abstract.
Benchmark description: the three task views are listed but lack a short illustrative example for each (e.g., one query per view); adding one concrete query per view would improve immediate clarity without lengthening the paper.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the work's significance in bridging web-scale search with pixel-level perception, and recommendation for minor revision. We are pleased that the WebEye benchmark and Pixel-Searcher are viewed as a useful resource and methodological template.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript introduces the WebEye benchmark (120 images, 473 instances, 1,927 samples) and the Pixel-Searcher agentic workflow, then reports direct experimental comparisons across Search-based Grounding, Segmentation, and VQA. No equations, derivations, or fitted parameters appear in the provided text. Claims rest on new benchmark construction and measured performance rather than any self-referential reduction of outputs to inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that external search is necessary for identity resolution in open-world perception; no free parameters are described, and the new benchmark and agent are introduced without additional invented physical entities.

axioms (1)

domain assumption A visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized in the image.
This premise defines the Perception Deep Research setting and the three task views in the abstract.

invented entities (2)

WebEye benchmark no independent evidence
purpose: Object-anchored evaluation set with verifiable evidence and precise annotations for search-based visual tasks
Newly introduced dataset with stated size and structure.
Pixel-Searcher no independent evidence
purpose: Agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers
Newly proposed system whose performance is claimed in the abstract.

pith-pipeline@v0.9.0 · 5506 in / 1508 out tokens · 154344 ms · 2026-05-13T05:41:10.388902+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Pixel-Searcher ... agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear
Search-Reason Loop ... Resolve ... h = R(q, E_1:T)

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · 12 internal anchors

[1]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 10 From Web to Pixels: Bringing Agentic Search into Visual Perception

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

work page arXiv 2026
[4]

Perception-r1: Pioneering perception policy with reinforcement learning.Advances in Neural Information Processing Systems, 38:94827–94853, 2026

En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, et al. Perception-r1: Pioneering perception policy with reinforcement learning.Advances in Neural Information Processing Systems, 38:94827–94853, 2026

work page 2026
[5]

Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.Advances in Neural Information Processing Systems, 38:143297–143330, 2026

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.Advances in Neural Information Processing Systems, 38:143297–143330, 2026

work page 2026
[6]

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

work page arXiv 2025
[7]

Mdetr- modulated detection for end-to-end multi-modal understanding

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 1780–1790, 2021

work page 2021
[8]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024

work page 2024
[9]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

work page 2024
[10]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024

work page 2024
[11]

Fvqa: Fact-based visual question answering.IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427, 2017

Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Fvqa: Fact-based visual question answering.IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427, 2017

work page 2017
[12]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

work page 2019
[13]

Jiang, R

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines. arXiv preprint arXiv:2409.12959, 2024

work page arXiv 2024
[14]

Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

work page arXiv 2025
[15]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Exploring Reasoning Reward Model for Agents

Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, and Xiangyu Yue. Exploring reasoning reward model for agents.arXiv preprint arXiv:2601.22154, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 11 From Web to Pixels: Bringing Agentic Search into Visual Perception

work page 2022
[19]

arXiv preprint arXiv:2508.05748 (2025)

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page arXiv 2025
[20]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

work page 2016
[21]

Segmentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In European conference on computer vision, pages 108–124. Springer, 2016

work page 2016
[22]

Onlinerefer: A simple online baseline for referring video object segmentation

Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, and Jianbing Shen. Onlinerefer: A simple online baseline for referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2761–2770, 2023

work page 2023
[23]

Lavt: Language-aware vision transformer for referring image segmentation

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Lavt: Language-aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022

work page 2022
[24]

Cris: Clip-driven referring image segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11686–11695, 2022

work page 2022
[25]

Prompt-driven referring image segmentation with instance contrasting

Chao Shang, Zichen Song, Heqian Qiu, Lanxiao Wang, Fanman Meng, and Hongliang Li. Prompt-driven referring image segmentation with instance contrasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4124–4134, 2024

work page 2024
[26]

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, et al. Adatooler-v: Adaptive tool-use for images and videos.arXiv preprint arXiv:2512.16918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Referdino: Referring video object segmentation with visual grounding foundations

Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, and Jian-Fang Hu. Referdino: Referring video object segmentation with visual grounding foundations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20009–20019, 2025

work page 2025
[28]

Ragnet: Large-scale reasoning-based affordance segmentation benchmark towards general grasping

Dongming Wu, Yanping Fu, Saike Huang, Yingfei Liu, Fan Jia, Nian Liu, Feng Dai, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, et al. Ragnet: Large-scale reasoning-based affordance segmentation benchmark towards general grasping. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11980–11990, 2025

work page 2025
[29]

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Retrieval augmented visual question answering with outside knowledge

Weizhe Lin and Bill Byrne. Retrieval augmented visual question answering with outside knowledge. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 11238–11254, 2022

work page 2022
[31]

Mm-deepresearch: A simple and effective multimodal agentic search baseline.arXiv preprint arXiv:2603.01050, 2026

Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, and Jiaxing Huang. Mm-deepresearch: A simple and effective multimodal agentic search baseline.arXiv preprint arXiv:2603.01050, 2026

work page arXiv 2026
[32]

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, and Tianyu Pang. Opensearch-vl: An open recipe for frontier multimodal search agents. arXiv preprint arXiv:2605.05185, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

ROSE: Retrieval-Oriented Segmentation Enhancement

Song Tang, Guangquan Jie, Henghui Ding, and Yu-Gang Jiang. Rose: Retrieval-oriented segmentation enhance- ment.arXiv preprint arXiv:2604.14147, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Seg-research: Segmentation with interleaved reasoning and external search.arXiv preprint arXiv:2602.04454, 2026

Tianming Liang, Qirui Du, Jian-Fang Hu, Haichao Jiang, Zicheng Lin, and Wei-Shi Zheng. Seg-research: Segmentation with interleaved reasoning and external search.arXiv preprint arXiv:2602.04454, 2026

work page arXiv 2026
[35]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 12 From Web to Pixels: Bringing Agentic Search into Visual Perception

work page 2023
[36]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025
[37]

Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

work page arXiv 2025
[38]

Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

work page arXiv 2025
[39]

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page arXiv 2025
[42]

Seg-r1: Segmentation can be surprisingly simple with reinforcement learning.arXiv preprint arXiv:2506.22624, 2025

Zuyao You and Zuxuan Wu. Seg-r1: Segmentation can be surprisingly simple with reinforcement learning.arXiv preprint arXiv:2506.22624, 2025

work page arXiv 2025
[43]

Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, and Yuexin Ma. Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9738–9746, 2026

work page 2026
[44]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page 2025
[45]

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

work page arXiv 2025
[46]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 13 From Web to Pixels: Bringing Agentic Search into Visual Perception A Dataset Samples Table 6 shows five representative...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

1-3 sub-questions, ordered by reasoning dependency

work page
[48]

Each sub-question should target one hop of reasoning

work page
[49]

If the question is already simple, return it as the only sub-question

work page
[50]

If the question asks about the item/person in the image, the last sub-question must still ask about that final target, not about an intermediate clue

Preserve the final target of the original question. If the question asks about the item/person in the image, the last sub-question must still ask about that final target, not about an intermediate clue

work page
[51]

Intermediate clues are for resolving the target, not for becoming the target

Do not let a year, event, or historical clue replace the final grounded entity. Intermediate clues are for resolving the target, not for becoming the target

work page
[52]

Multi-round search agent

Return only JSON. Multi-round search agent. You are a multi-round reasoning agent for visual grounding. Your goal is to identify the exact entity described by the question so it can be located in an image. Original question: {question} Sub-questions: {sub_questions} Accumulated evidence so far: {evidence} Interaction round {round_num} of {max_rounds}. Onl...

work page
[53]

action":

{"action": "SEARCH", "query": "your DuckDuckGo search query"}

work page
[54]

action":

{"action": "THINK", "reasoning": "your reasoning based on evidence so far"}

work page
[55]

action":

{"action": "ANSWER", "entity_name": "resolved entity", "visual_category": "phone/person/car/...", "entity_type": "device /person/character/vehicle/object", "key_cues": ["cue1", "cue2"], "confidence": 0.0-1.0} Guidelines: - Use SEARCH to gather information you don’t have yet. - Use THINK only to briefly consolidate evidence before the next action. - If evi...

work page
[56]

Answer the actual item/person that should be located in the image

work page
[57]

Do not answer with an intermediate clue entity, historical reference, designer, event, or source article unless that is also the visible target

work page
[58]

Prefer the concrete visible model/person/character over a generic series or franchise name

work page
[59]

Only return an exact model/person if the evidence explicitly supports that exact target; otherwise return the best supported visible target

work page
[60]

is_consistent

Return only JSON. Entity verification. You are checking whether a resolved entity is actually consistent with a visual grounding question and the gathered evidence. Question: {question} Proposed entity: {entity_name} Visual category: {visual_category} Entity type: {entity_type} Key cues: {key_cues} Evidence: {evidence} Return strict JSON: {"is_consistent"...

work page
[61]

Mark is_consistent false if the proposed entity seems to be the wrong product/person/character/model, too generic, unsupported by evidence, or an intermediate clue rather than the final visible target in the image

work page
[62]

consistency_score 5 means the entity is well supported and specific

work page
[63]

If inconsistent, provide 1-2 targeted followup_queries to resolve the remaining ambiguity

work page
[64]

Do not mark an entity consistent unless the evidence explicitly supports that exact model/person, not just a nearby series, sibling model, platform, or speculative variant

For model-level answers, exact evidence matters. Do not mark an entity consistent unless the evidence explicitly supports that exact model/person, not just a nearby series, sibling model, platform, or speculative variant

work page
[65]

entity_name

Return only JSON. Entity repair. The current resolved entity for a visual grounding question appears unreliable. Question: {question} Current entity: {entity_name} Known issues with the current entity: {issues} Evidence: {evidence} Return strict JSON: {"entity_name": "better entity", "visual_category": "phone/person/car/object/...", "entity_type": "device...

work page
[66]

Do not stick to the current entity if it is unsupported

Re-resolve the entity from the evidence. Do not stick to the current entity if it is unsupported

work page
[67]

17 From Web to Pixels: Bringing Agentic Search into Visual Perception

Prefer the most concrete model/person/character/entity actually supported by the evidence. 17 From Web to Pixels: Bringing Agentic Search into Visual Perception

work page
[68]

If the question asks about the item/person in the image, answer that final visible target, not an intermediate clue used to identify it

work page
[69]

Prefer an exact model/person only when it is explicitly supported by the evidence; otherwise step back to the best supported visible target

work page
[70]

If evidence is insufficient, still return the best alternative guess

work page
[71]

{entity_name}

Return only JSON. B.2 Visual Grounding Visual appearance extraction. Given search results about the appearance of "{entity_name}", extract a concise visual description focusing on shape, color , size, logos, and distinguishing physical features. Search results: {search_evidence} Return strict JSON: {"visual_description": "1-3 sentence description of how i...

work page
[72]

bbox must use absolute pixel coordinates in the FIRST image only

work page
[73]

Use the attached reference images to find the same object/person/icon/model

work page
[74]

If several similar instances exist, choose the one best matching the cues

work page
[75]

Avoid broad boxes that cover multiple objects, large empty regions , or the center area between objects

Return a tight box around one concrete instance only. Avoid broad boxes that cover multiple objects, large empty regions , or the center area between objects

work page
[76]

If no plausible match exists, return bbox null

work page
[77]

best_candidate_id

Return only JSON. Candidate joint ranking. You are selecting the best matching candidate in the FIRST image. The FIRST image is the full scene with all candidate boxes labeled. The next candidate images are crops in this order: {candidate_order} Any remaining images are web reference images for the target entity. Question: {reference_text} Entity name: {e...

work page
[78]

Compare the labeled boxes in the overview image and the candidate crops jointly

work page
[79]

Prefer exact instance-level matches, not just same coarse category

work page
[80]

Use reference images when available

work page

Showing first 80 references.