arxiv: 2605.07177 · v2 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Guankai Li , Jiabin Chen , Yi Xu , Xichen Zhang , Yuan Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multimodal agentsreinforcement learningparallel tool useefficiency-aware trainingvisual groundingdata synthesis pipelinetool-call reductionsearch agents

0 comments

The pith

HyperEyes trains multimodal agents to search multiple entities concurrently in one round rather than sequentially, delivering 9.9 percent higher accuracy with 5.3 times fewer tool calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing multimodal search agents issue one tool call per entity and accumulate extra rounds on queries that break into independent parts. HyperEyes instead fuses visual grounding and retrieval into a single atomic action that dispatches multiple grounded queries at once. It reaches this behavior through a two-stage process that first synthesizes parallel-friendly trajectories and then applies dual-grained reinforcement learning to treat efficiency as a primary objective. The macro-level TRACE reward progressively tightens efficiency targets across entire trajectories, while micro-level on-policy distillation supplies token corrections on failed attempts. A reader would care because most real queries contain independent sub-tasks, so fewer rounds translate directly into faster and less expensive agent operation.

Core claim

The paper establishes that a parallel multimodal search agent trained with Dual-Grained Efficiency-Aware Reinforcement Learning can surpass prior open-source agents by 9.9 percent in accuracy while using 5.3 times fewer tool-call rounds on average across six benchmarks. The method first creates cold-start data via a Parallel-Amenable Data Synthesis Pipeline and Progressive Rejection Sampling, then optimizes with a trajectory-level TRACE reward that monotonically tightens its efficiency reference and with on-policy distillation that adds dense token-level signals on failed rollouts.

What carries the argument

Dual-Grained Efficiency-Aware Reinforcement Learning, which applies a trajectory-level TRACE reward for cost efficiency at the macro scale and on-policy distillation for token-level corrections at the micro scale.

If this is right

Multimodal agents can treat parallel dispatch as the default action for queries with independent sub-retrievals.
Efficiency metrics must be included in future agent benchmarks because accuracy alone does not capture real deployment cost.
Open-source agents can match or exceed closed systems on joint accuracy-efficiency leaderboards.
The same dual-grained reward structure could be applied to other tool-use domains that allow concurrent actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployed systems using HyperEyes would see lower cumulative API costs and shorter user wait times when handling multi-entity visual-text queries.
The approach suggests that efficiency-aware training could reduce context-length pressure in long agent sessions by trimming unnecessary intermediate steps.
Future work could test whether the same pipeline generalizes to agents that combine search with external code execution or database tools.

Load-bearing premise

The Parallel-Amenable Data Synthesis Pipeline and Progressive Rejection Sampling produce trajectories that keep necessary multi-hop reasoning intact while the TRACE reward and on-policy distillation optimize efficiency without adding bias or losing capability.

What would settle it

A controlled test on the IMEB benchmark or similar data showing that accuracy falls below the strongest baseline once the efficiency rewards are removed, or that human raters detect clear losses in multi-hop reasoning quality on the parallel trajectories.

Figures

Figures reproduced from arXiv: 2605.07177 by Guankai Li, Jiabin Chen, Xichen Zhang, Yi Xu, Yuan Lu.

**Figure 2.** Figure 2: Overview of the HyperEyes training framework. The framework consists of two main [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 2.** Figure 2: Overview of the HyperEyes training framework. The framework consists of two main [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the IMEB benchmark, including domain distribution ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 3.** Figure 3: Overview of the IMEB benchmark, including domain distribution ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of three search paradigms under controlled conditions. (a, b) Accuracy and [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between the representative serial agent DeepEyes-V2 and our parallel grounded [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

read the original abstract

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes HyperEyes, a parallel multimodal search agent designed to perform concurrent searches across multiple entities in a single round rather than sequential tool calls. It introduces a two-stage training approach consisting of a Parallel-Amenable Data Synthesis Pipeline with Progressive Rejection Sampling for cold-start, followed by a Dual-Grained Efficiency-Aware Reinforcement Learning framework. This framework includes the TRACE (Tool-use Reference-Adaptive Cost Efficiency) trajectory-level reward with monotonically tightened references and On-Policy Distillation for token-level signals. Additionally, the paper presents the IMEB benchmark for evaluating both search capability and efficiency. The central empirical claim is that the 30B model variant outperforms the strongest open-source baseline by 9.9% in accuracy while using 5.3 times fewer tool-call rounds on average across six benchmarks.

Significance. If the results are confirmed, this work could have substantial impact on the development of efficient multimodal agents by establishing parallel search as a viable strategy and integrating efficiency as a primary optimization objective in RL training. The TRACE reward and IMEB benchmark represent potentially useful contributions to the field, provided they demonstrate robustness beyond the reported settings.

major comments (1)

Abstract: The reported performance improvements (9.9% accuracy gain and 5.3x reduction in tool calls) are central to the paper's contribution but are presented without supporting details on experimental setup, baseline comparisons, ablation studies, or statistical significance. This prevents verification of the claims and assessment of whether the gains are attributable to the proposed methods or other factors.

minor comments (1)

Ensure that all acronyms such as TRACE and IMEB are fully defined at their first occurrence in the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for reviewing our manuscript and for your feedback on the presentation of the central empirical claims. We address the major comment below.

read point-by-point responses

Referee: Abstract: The reported performance improvements (9.9% accuracy gain and 5.3x reduction in tool calls) are central to the paper's contribution but are presented without supporting details on experimental setup, baseline comparisons, ablation studies, or statistical significance. This prevents verification of the claims and assessment of whether the gains are attributable to the proposed methods or other factors.

Authors: We agree that the abstract, constrained by typical length limits, presents the key results at a summary level without the full experimental details. The complete manuscript elaborates the evaluation protocol, the six benchmarks, comparisons against the strongest open-source baselines, ablation studies on the data synthesis pipeline, TRACE reward, and On-Policy Distillation components, as well as statistical reporting via means and standard deviations over repeated runs. The 9.9% accuracy and 5.3x efficiency figures are averages across those benchmarks. To directly address the concern, we will partially revise the abstract by adding a concise clause noting that results are averaged over six benchmarks with efficiency measured in tool-call rounds, while preserving its brevity and directing readers to the main text for verification. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained in abstract

full rationale

The abstract describes a coherent two-stage process (cold-start via Parallel-Amenable Data Synthesis Pipeline and Progressive Rejection Sampling, followed by TRACE reward at macro level and On-Policy Distillation at micro level) plus a new benchmark IMEB, without providing equations, fitted parameters, or self-citations that reduce any claimed prediction or result to its inputs by construction. All load-bearing elements are presented as novel independent contributions, with no visible self-definitional loops, renamed known results, or ansatzes smuggled via prior work. This is the expected honest non-finding when only high-level text is available and no internal reductions can be exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only view reveals several introduced components whose internal definitions and assumptions cannot be audited; no explicit free parameters or axioms are stated, but the framework implicitly relies on the validity of the data synthesis and reward shaping.

invented entities (2)

TRACE (Tool-use Reference-Adaptive Cost Efficiency) reward no independent evidence
purpose: Trajectory-level efficiency signal with monotonically tightened reference
Central macro-level component of the dual-grained RL framework; no independent evidence provided.
IMEB benchmark no independent evidence
purpose: Human-curated 300-instance test set jointly measuring accuracy and tool-call efficiency
New evaluation resource introduced to address the gap in existing benchmarks; no release details given.

pith-pipeline@v0.9.0 · 5574 in / 1348 out tokens · 33595 ms · 2026-05-12T03:07:43.802228+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dual-Grained Efficiency-Aware Reinforcement Learning framework... Progressive Rejection Sampling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 10 internal anchors

[1]

PaLM 2 Technical Report

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023. URLhttps://doi.org/10.48550/arXiv.2305.10403

work page internal anchor Pith review doi:10.48550/arxiv.2305.10403 2023
[2]

Introducing Claude Opus 4.6, February 2026

Anthropic. Introducing Claude Opus 4.6, February 2026. URL https://www.anthropic. com/news/claude-opus-4-6. Accessed: 2026-05-07

work page 2026
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,...

work page 2020
[5]

Redsearcher: A scalable and cost-efficient framework for long-horizon search agents.arXiv preprint arXiv:2602.14234, 2026

Chu, Z., Wang, X., Hong, J., Fan, H., Huang, Y ., Yang, Y ., Xu, G., Zhao, C., Xiang, C., Hu, S., et al. Redsearcher: A scalable and cost-efficient framework for long-horizon search agents. arXiv preprint arXiv:2602.14234, 2026

work page arXiv 2026
[6]

Gemini 3.1 Pro: A smarter model for your most complex tasks, Febru- ary 2026

DeepMind, G. Gemini 3.1 Pro: A smarter model for your most complex tasks, Febru- ary 2026. URL https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/. Accessed: 2026-05-07

work page 2026
[7]

S., and Krishna, R

Fu, M., Peng, Y ., Chen, D., Zhou, Z., Liu, B., Wan, Y ., Zhao, Z., Yu, P. S., and Krishna, R. Seeking and updating with live visual knowledge.arXiv preprint arXiv:2504.05288, 2025

work page arXiv 2025
[8]

Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

Geng, X., Xia, P., Zhang, Z., Wang, X., Wang, Q., Ding, R., Wang, C., Wu, J., Zhao, Y ., Li, K., et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page arXiv 2025
[9]

MiniLLM: Knowledge distillation of large language models

Gu, Y ., Dong, L., Wei, F., and Huang, M. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ

work page 2024
[10]

Deepeyesv2: Toward agentic multimodal model

Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., and XingYu. Deepeyesv2: Toward agentic multimodal model. InThe Fourteenth International Conference on Learning Representations,

work page
[11]

URLhttps://openreview.net/forum?id=yDKawwfJ5O

work page
[12]

Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

Huang, W., Zeng, Y ., Wang, Q., Fang, Z., Cao, S., Chu, Z., Yin, Q., Chen, S., Yin, Z., Chen, L., et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

work page arXiv 2026
[13]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

Jiang, D., Zhang, R., Guo, Z., Wu, Y ., Lei, J., Qiu, P., Lu, P., Chen, Z., Fu, C., Song, G., et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

work page arXiv 2024
[15]

O., Wang, D., Zamani, H., and Han, J

Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S. O., Wang, D., Zamani, H., and Han, J. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= Rwhi91ideu

work page 2025
[16]

Hybrid deep searcher: Scalable parallel and sequential search reasoning

Ko, D., Kim, J., Park, H., Kim, S., Lee, D., Jo, Y ., Kim, G., Lee, M., and Lee, K. Hybrid deep searcher: Scalable parallel and sequential search reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=rXpTZyucal. 12

work page 2026
[17]

3d object representations for fine-grained categorization

Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d object representations for fine-grained categorization. InIEEE Workshop on 3D Representation and Recognition, 2013

work page 2013
[18]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[19]

arXiv preprint , year =

Lin, X., Liew, J. H., Savarese, S., and Li, J. W&d: Scaling parallel tool calling for efficient deep research agents.arXiv preprint arXiv:2602.07359, 2026

work page arXiv 2026
[20]

Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl.arXiv preprint arXiv:2509.10446, 2025

Lu, R., Hou, Z., Wang, Z., Zhang, H., Liu, X., Li, Y ., Feng, S., Tang, J., and Dong, Y . Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl.arXiv preprint arXiv:2509.10446, 2025

work page arXiv 2025
[21]

Fine-Grained Visual Classification of Aircraft

Maji, S., Kannala, J., Rahtu, E., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[22]

Y ., and Gkioxari, G

Marsili, D., Mehta, A., Lin, R. Y ., and Gkioxari, G. Same or not? enhancing visual perception in vision-language models.arXiv preprint arXiv:2512.23592, 2025

work page arXiv 2025
[23]

M., Shiee, N., Grasch, P., Jia, C., Yang, Y ., and Gan, Z

Narayan, K., Xu, Y ., Cao, T., Nerella, K., Patel, V . M., Shiee, N., Grasch, P., Jia, C., Yang, Y ., and Gan, Z. Deepmmsearch-r1: Empowering multimodal llms in multimodal web search.arXiv preprint arXiv:2510.12801, 2025

work page arXiv 2025
[24]

and Zisserman, A

Nilsback, M.-E. and Zisserman, A. Automated flower classification over a large number of classes. InProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008

work page 2008
[25]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[26]

M., Vedaldi, A., Zisserman, A., and Jawahar, C

Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. V . Cats and dogs. InIEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012
[27]

In: Vlachos, A., Augen- stein, I

Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N., and Lewis, M. Measuring and narrowing the compositionality gap in language models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687–5711, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/...

work page doi:10.18653/v1/2023 2023
[28]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T. P., Alayrac, J., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., Antonoglou, I., Anil, R., Borgeaud, S., Dai, A. M., Millican, K., Dyer, E., Glaese, M., Sottiaux, T., Lee, B., Viola, F., Reynolds, M., Xu, Y ., Molloy, J., Chen, J., Isard, M., Barham, P., Hennigan, T., McIlroy, R....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[29]

Relax: An asynchronous reinforcement learning framework for large-scale agentic models.https://github.com/redai-infra/Relax, 2026

Relax Contributors. Relax: An asynchronous reinforcement learning framework for large-scale agentic models.https://github.com/redai-infra/Relax, 2026. Open-source software

work page 2026
[30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv preprint, abs/2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. 13

work page internal anchor Pith review Pith/arXiv arXiv 1909
[32]

Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

Tao, X., Teng, Y ., Su, X., Fu, X., Wu, J., Tao, C., Liu, Z., Bai, H., Liu, R., and Kong, L. Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents. arXiv preprint arXiv:2508.21475, 2025

work page arXiv 2025
[33]

Team, K., Bai, T., Bai, Y ., Bao, Y ., Cai, S., Cao, Y ., Charles, Y ., Che, H., Chen, C., Chen, G., et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Team, M., Bai, S., Bing, L., Chen, C., Chen, G., Chen, Y ., Chen, Z., Chen, Z., Dai, J., Dong, X., et al. Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

E., Aodha, O

Vendrow, E., Pantazis, O., Shepard, A., Brostow, G., Jones, K. E., Aodha, O. M., Beery, S., and Horn, G. V . INQUIRE: A natural world text-to-image retrieval benchmark. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=jbrMS0DNaD

work page 2024
[36]

and Krötzsch, M

Vrandeˇci´c, D. and Krötzsch, M. Wikidata: a free collaborative knowledgebase.Communications of the ACM, 57(10):78–85, 2014

work page 2014
[37]

The Caltech-UCSD Birds- 200-2011 Dataset

Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The Caltech-UCSD Birds- 200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011

work page 2011
[38]

Google landmarks dataset v2 - a large-scale benchmark for instance-level recognition and retrieval

Weyand, T., Araujo, A., Cao, B., and Sim, J. Google landmarks dataset v2 - a large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020. URL https://arxiv. org/abs/2004.01804

work page arXiv 2020
[39]

Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

Wu, J., Deng, Z., Li, W., Liu, Y ., You, B., Li, B., Ma, Z., and Liu, Z. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

work page arXiv 2025
[40]

Open data synthesis for deep research.arXiv preprint arXiv:2509.00375, 2025

Xia, Z., Luo, K., Qian, H., and Liu, Z. Open data synthesis for deep research.arXiv preprint arXiv:2509.00375, 2025

work page arXiv 2025
[41]

R., and Cao, Y

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y . React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[42]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Zhang, Q., Lyu, F., Sun, Z., Wang, L., Zhang, W., Hua, W., Wu, H., Guo, Z., Wang, Y ., Muennighoff, N., et al. A survey on test-time scaling in large language models: What, how, where, and how well?arXiv preprint arXiv:2503.24235, 2025

work page internal anchor Pith review arXiv 2025
[43]

Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

Zhang, Y ., Hu, L., Sun, H., Wang, P., Wei, Y ., Yin, S., Pei, J., Shen, W., Xia, P., Peng, Y ., et al. Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

work page arXiv 2025
[44]

Parallelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning.arXiv preprint arXiv:2508.09303, 2025

Zhao, S., Yu, T., Xu, A., Singh, J., Shukla, A., and Akkiraju, R. Parallelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning.arXiv preprint arXiv:2508.09303, 2025

work page arXiv 2025
[45]

arXiv preprint arXiv:2408.05517 (2024),https://arxiv.org/abs/ 2408.05517

Zhao, Y ., Huang, J., Hu, J., Wang, X., Mao, Y ., Zhang, D., Zhang, H., Jiang, Z., Wu, Z., Ai, B., Wang, A., Zhou, W., and Chen, Y . SWIFT: A scalable lightweight infrastructure for fine-tuning. arXiv preprint arXiv:2408.05517, 2024. URLhttps://arxiv.org/abs/2408.05517

work page arXiv 2024
[46]

top-left

Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., and Sheng, Y . SGLang: Efficient execution of structured language model programs. InNeurIPS, 2024. 14 A Limitations While HyperEyes establishes a robust baseline for efficient multimodal search, we identify several limitations. Fir...

work page 2024
[47]

Common reference trajectories.For each evaluation query, we use a strong external agent (Kimi-K2.5) to generate a successful tool-use trajectory. We then select the 48 samples on which both theBasepolicy (Qwen3-VL-235B-A22B-Instruct) andOurs(HyperEyes-235B SFT) can correctly answer when conditioned on this clean trajectory, ensuring both models start from...

work page
[48]

We then issue these distractor queries to SerpAPI and collect their top- 3 snippets as the distractor evidence pool

Distractor synthesis.For each trajectory, we extract its last-round search query and prompt Gemini-3.0-Flash to generate 10 in-domain but answer-misleading paraphrased queries. We then issue these distractor queries to SerpAPI and collect their top- 3 snippets as the distractor evidence pool

work page
[49]

crop-then-search

Noise injection & shuffling.We inject K distractor snippets (K∈ {1,3,5,7,10} ) into the last-round retrieval output. To remove any bias from evidence ordering, each (trajectory, K) pair is evaluated 10 times with independently shuffled orderings of the combined evidence list, and we report the mean accuracy across these shuffles. Discussion.As shown in Ta...

work page
[51]

name": "image_search

text_search: Text search, retrieves relevant web content -`input`(list[string]): Query content, supports passing multiple queries at once Tool Call Format Examples: - Image search (full image): <tool_call>{"name": "image_search", "arguments": {"image_id": "img_0"}}</tool_call> - Image search (region): <tool_call>{"name": "image_search", "arguments": {"ima...

work page
[52]

img_0",

crop_image: Image cropping tool for precisely extracting target regions from images -`image_id`(string): Source image ID, e.g., "img_0", "img_1" -`prompt`(string): Natural language description of the region to crop; use directional terms or subject names, and the system will locate and crop accordingly

work page
[53]

img_0",

image_search: Image search, retrieves web content of similar images -`image_id`(string): Image ID, e.g., "img_0", "img_1"

work page
[54]

name": "crop_image

text_search: Text search, retrieves relevant web content -`input`(list[string]): Query content, supports passing multiple queries at once Tool Call Format Examples: - Image crop: <tool_call>{"name": "crop_image", "arguments": {"image_id": "img_0", "prompt": "the dog in the upper left corner"}}</tool_call> - Image search: <tool_call>{"name": "image_search"...

work page
[55]

- Input images are preloaded as global variables img_0, img_1,

python: Python code cropping tool that runs in a Jupyter Notebook environment, used to crop one or more regions from existing images. - Input images are preloaded as global variables img_0, img_1, ... (PIL.Image format) - Multiple regions from the same image or multiple existing images can be cropped within a single <code> block - Use plt.show() to displa...

work page
[56]

img_0",

image_search: Reverse image search to retrieve similar images and contextual information. -`image_id`(list[string]): List of target image IDs, e.g., ["img_0", "img_1"] - For better accuracy, ensure the image used for search has a single, clear subject; recommend cropping with python first before searching

work page
[57]

name": "image_search

text_search: Text search to retrieve relevant web content. -`input`(list[string]): Query content; supports multiple queries at once for parallel retrieval. Tool Call Format Examples: - Python code crop (crop region of interest from image): <code> ```python w, h = img_0.size # Crop the object in the upper-left area of the image crop_1 = img_0.crop((int(w*0...

work page 2024
[58]

img_0",

image_search: Image search, retrieves web content of similar images -`image_id`(string): Image ID, e.g., "img_0", "img_1" -`area`(list[list[float]], optional): List of normalized coordinates [[x1,y1,x2,y2], ...], specifying the target region to be searched Coordinate range 0.0~1.0 (x corresponds to the width direction, y corresponds to the height direction)

work page
[59]

name": "image_search

text_search: Text search, retrieves relevant web content -`input`(list[string]): Query content, supports passing multiple queries at once Tool Call Format Examples: - Image search (full image): <tool_call>{"name": "image_search", "arguments": {"image_id": "img_0"}}</tool_call> - Image search (region): <tool_call>{"name": "image_search", "arguments": {"ima...

work page 1934
[60]

img_4 search result: • Web 1

was one of Japan's leading architects ...... img_4 search result: • Web 1. We are deeply saddened by the passing of Fumihiko Maki on June 6, 2024, at the age of 95. ..... img_5 search result: ...... img_6 search result: ...... Identifying that Kisho Kurokawa passed away in 2007 and ran for office requires confirming the party he represented and several ke...

work page 2024