pith. machine review for the scientific record. sign in

arxiv: 2604.14029 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

Jiangchao Yao, Le Tian, Weidi Xie, Xiao Zhou, Yanfeng Wang, Yikun Liu, Yuan Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal agentic searchAgentic SeedingV-Fold compressionlong-horizon visual reasoningPOINTS-Seekerfrom-scratch trainingevidence retrievalknowledge-intensive tasks
0
0 comments X

The pith

Training a multimodal agentic search model from scratch with Agentic Seeding and V-Fold yields consistent gains over retrofitted models on long-horizon visual reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that multimodal search capabilities can be built directly into a model during initial training instead of being added later as tools to general large multimodal models. It introduces Agentic Seeding as an early phase to instill the basic habits of active evidence gathering and proposes V-Fold as a way to keep interaction histories manageable by compressing older turns into visual renderings. A sympathetic reader would care because current models lose effectiveness once dialogue histories grow long and they cannot reliably pull in external knowledge for image-based questions. If the approach holds, it opens a path to agents that sustain extended reasoning chains without the fragmentation that modular add-ons often produce.

Core claim

The central claim is that POINTS-Seeker-8B, trained entirely from scratch, reaches state-of-the-art results across six benchmarks by first using Agentic Seeding to embed precursors of agentic behavior and then applying V-Fold to compress expanding interaction histories. V-Fold keeps the most recent dialogue turns at full fidelity while folding earlier context into the visual space through rendering, thereby removing the bottleneck that otherwise overwhelms the model when histories become long. The resulting system therefore handles knowledge-intensive visual reasoning more reliably than models that simply attach search tools to pretrained backbones.

What carries the argument

Agentic Seeding combined with V-Fold: Agentic Seeding is the dedicated early training phase that installs foundational agentic behaviors, while V-Fold is the adaptive compression method that preserves recent turns verbatim and renders historical context as images to reduce token load.

If this is right

  • Longer interaction sequences become feasible without loss of evidence location accuracy.
  • Knowledge-intensive visual reasoning tasks can be solved through repeated external retrieval rather than reliance on parametric memory alone.
  • Specialized agentic behaviors can be instilled without first training a general multimodal foundation model.
  • History compression via visual rendering offers a scalable way to extend context length in multimodal agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar seeding and folding techniques could be tested on language-only agentic models to check whether the from-scratch advantage generalizes beyond vision.
  • If V-Fold proves robust, it suggests a broader design pattern for any long-context agent where recent actions stay explicit and older state is summarized in a modality the model already processes well.
  • The results raise the question of whether the data used for Agentic Seeding can be further distilled to reduce overall training cost while retaining the observed gains.

Load-bearing premise

The performance advantages arise specifically from the from-scratch training regimen, Agentic Seeding phase, and V-Fold compression rather than from differences in model scale, training data, or benchmark construction.

What would settle it

A side-by-side comparison in which a similarly sized retrofitted model, trained on equivalent data and evaluated on the same six benchmarks, matches or exceeds POINTS-Seeker-8B accuracy.

Figures

Figures reproduced from arXiv: 2604.14029 by Jiangchao Yao, Le Tian, Weidi Xie, Xiao Zhou, Yanfeng Wang, Yikun Liu, Yuan Liu.

Figure 1
Figure 1. Figure 1: Overview of POINTS-Seeker. Our model adaptively interacts with exter￾nal tools to solve complex, multi-hop VQA tasks. By leveraging V-Fold compression, stale history is rendered into compact visual tokens, effectively bypassing long-context performance degradation while achieving superior efficiency and reasoning fidelity. Currently, the prevailing paradigm [17, 32, 49, 61] for developing multimodal search… view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of our proposed agentic search system. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Motivation for history-aware compression. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tool usage analysis for POINTS-Seeker across six diverse evaluation [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of POINTS-Seeker. Details are simplified for visualization. address this question in a step-by-step manner. First, the model performs ques￾tion decomposition, prioritizing geographic localization before retrieving specific information. It invokes an image search tool to identify the region depicted in the image as Fort Lauderdale. Subsequently, POINTS-Seeker conducts multiple rounds of web se… view at source ↗
Figure 6
Figure 6. Figure 6: Case Study 1: SFT vs. RL. Details are simplified for visualization. Question: Which location of this university in the image experienced the earliest permanent closure following a reduction in operating hours? <think> Based on my search, I can confirm that **Saxbys at 40th and Locust**: Reduced hours (eliminated weekend operations) in April 2025, then closed permanently on May 2, 2025 … </think> <answer> …… view at source ↗
Figure 7
Figure 7. Figure 7: Case Study 2: SFT vs. RL. Details are simplified for visualization. C More Qualitative Results C.1 Case Study [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
read the original abstract

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes POINTS-Seeker-8B, a multimodal agentic search model trained entirely from scratch rather than by retrofitting existing LMMs. It introduces Agentic Seeding as an initial phase to instill foundational agentic behaviors and V-Fold as an adaptive compression technique that renders historical context visually to mitigate overload in long-horizon interactions. The central claim is that this training paradigm yields a model that consistently outperforms prior approaches across six diverse benchmarks for knowledge-intensive visual reasoning.

Significance. If the performance claims are supported by controlled experiments, the work would be significant because it directly tests whether purpose-built from-scratch training for agentic search can surpass the prevailing retrofit paradigm. The V-Fold mechanism addresses a concrete, widely observed failure mode (context overload in extended tool-use trajectories) with a potentially generalizable visual-folding approach. Explicit credit is due for focusing on long-horizon bottlenecks rather than short single-turn retrieval.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts that POINTS-Seeker-8B 'consistently outperforms existing models across six diverse benchmarks' and 'effectively resolv[es] the challenges of long-horizon, knowledge-intensive visual reasoning,' yet supplies no numerical results, baseline comparisons, or error bars. This is load-bearing for the central claim; without these data the attribution of gains to Agentic Seeding and V-Fold versus differences in scale, data volume, or evaluation protocol cannot be evaluated.
  2. [§3 and §4] §3 (Method) and §4: No ablations are described that isolate Agentic Seeding or V-Fold (e.g., full model vs. variant without Agentic Seeding, vs. variant without V-Fold, or vs. retrofitted LMMs at identical 8B scale and training tokens) under matched search-tool interfaces and benchmark protocols. The stress-test concern therefore lands: the causal claim that the new training phases resolve the long-horizon bottleneck rests on an untested assumption.
minor comments (2)
  1. [§3] The acronym 'V-Fold' is introduced without an explicit expansion or diagram showing the rendering step; a small illustrative figure would improve clarity.
  2. [Abstract] The abstract uses 'parameter-free' phrasing in passing; if any component of V-Fold or the seeding loss contains tunable hyperparameters, this should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for clearer empirical support. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and ablations.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts that POINTS-Seeker-8B 'consistently outperforms existing models across six diverse benchmarks' and 'effectively resolv[es] the challenges of long-horizon, knowledge-intensive visual reasoning,' yet supplies no numerical results, baseline comparisons, or error bars. This is load-bearing for the central claim; without these data the attribution of gains to Agentic Seeding and V-Fold versus differences in scale, data volume, or evaluation protocol cannot be evaluated.

    Authors: We agree that the abstract and opening of §4 should explicitly include numerical results, baseline comparisons, and error bars to substantiate the claims. The current draft states the performance outcomes qualitatively without quoting specific metrics or error bars in these sections. We will revise the abstract to report key benchmark scores and average improvements, and add a concise summary paragraph with numerical values, baselines, and error bars at the start of the Experiments section. This will allow direct assessment of gains attributable to our methods versus scale or data factors. revision: yes

  2. Referee: [§3 and §4] §3 (Method) and §4: No ablations are described that isolate Agentic Seeding or V-Fold (e.g., full model vs. variant without Agentic Seeding, vs. variant without V-Fold, or vs. retrofitted LMMs at identical 8B scale and training tokens) under matched search-tool interfaces and benchmark protocols. The stress-test concern therefore lands: the causal claim that the new training phases resolve the long-horizon bottleneck rests on an untested assumption.

    Authors: We acknowledge that the manuscript does not currently include ablations isolating Agentic Seeding or V-Fold. To address this, we will add a dedicated ablation subsection in the revised §4. This will report results for the full model versus variants without Agentic Seeding, without V-Fold (replacing it with standard text history), and versus an 8B-scale retrofitted LMM baseline, all under matched tool interfaces, data volume, and benchmark protocols. While these require additional compute, we will include them to directly test the causal contributions to long-horizon performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training claims rest on experimental results, not self-referential derivations

full rationale

The manuscript describes an empirical training pipeline for POINTS-Seeker-8B using two new techniques (Agentic Seeding and V-Fold) and reports benchmark outperformance. No equations, fitted parameters, or first-principles derivations appear in the provided text. The central claims are supported by experimental comparisons rather than any reduction of outputs to inputs by construction, self-citation chains, or renamed known results. This is a standard empirical ML paper whose validity hinges on ablation quality and baseline fairness, not on logical circularity in a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated assumptions that the six benchmarks are representative of long-horizon visual reasoning, that the training data and compute are comparable to prior work, and that V-Fold actually preserves necessary evidence rather than introducing artifacts. No free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5520 in / 1192 out tokens · 40211 ms · 2026-05-10T13:05:40.338701+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.

Reference graph

Works this paper leans on

96 extracted references · 45 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 1

  2. [2]

    In: EMNLP (2022) 3

    Chen, W., Hu, H., Chen, X., Verga, P., Cohen, W.: Murag: Multimodal retrieval- augmented generator for open question answering over images and text. In: EMNLP (2022) 3

  3. [3]

    arXiv preprint arXiv:2407.21439 , year=

    Chen, Z., Xu, C., Qi, Y., Guo, J.: Mllm is a strong reranker: Advancing mul- timodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training. arXiv preprint arXiv:2407.21439 (2024) 3

  4. [4]

    arXiv preprint arXiv:2510.17800 (2025) 11

    Cheng, J., Liu, Y., Zhang, X., Fei, Y., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X., Liu, X., et al.: Glyph: Scaling context windows via visual-text compression. arXiv preprint arXiv:2510.17800 (2025) 4, 9, 21

  5. [5]

    In: ICCV (2025) 9

    Cheng, X., Zhang, W., Zhang, S., Yang, J., Guan, X., Wu, X., Li, X., Zhang, G., Liu, J., Mai, Y., et al.: Simplevqa: Multimodal factuality evaluation for multimodal large language models. In: ICCV (2025) 9

  6. [6]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Chhikara, P., Khant, D., Aryan, S., Singh, T., Yadav, D.: Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413 (2025) 4

  7. [7]

    Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

    Chng, Y.X., Hu, T., Tong, W., Li, X., Chen, J., Yu, H., Lu, J., Guo, H., Deng, H., Xie, C., Huang, G., Lin, D., Lu, L.: Sensenova-mars: Empowering multi- modal agentic reasoning and search via reinforcement learning. arXiv preprint arXiv:2512.24330 (2025) 11

  8. [8]

    Redsearcher: A scalable and cost-efficient framework for long-horizon search agents.arXiv preprint arXiv:2602.14234, 2026

    Chu, Z., Wang, X., Hong, J., Fan, H., Huang, Y., Yang, Y., Xu, G., Hu, S., Kuang, D., Zhao, C., Xiang, C., Liu, M., Qin, B., Yu, X.: Redsearcher: A scalable and cost- efficient framework for long-horizon search agents. arXiv preprint arXiv:2602.14234 (2026),https://arxiv.org/pdf/2602.142343

  9. [9]

    PaddleOCR 3.0 Technical Report

    Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al.: Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595 (2025) 19

  10. [10]

    and Yang, F

    Feng, L., Yang, F., Chen, F., Cheng, X., Xu, H., Wan, Z., Yan, M., An, B.: Agentocr: Reimagining agent history via optical self-compression. arXiv preprint arXiv:2601.04786 (2026) 4, 9

  11. [11]

    S., and Krishna, R

    Fu, M., Peng, Y., Chen, D., Zhou, Z., Liu, B., Wan, Y., Zhao, Z., Yu, P.S., Krishna, R.: Seeking and updating with live visual knowledge. arXiv preprint arXiv:2504.05288 (2025) 6, 9

  12. [12]

    Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

    Geng, X., Xia, P., Zhang, Z., Wang, X., Wang, Q., Ding, R., Wang, C., Wu, J., Zhao, Y., Li, K., et al.: Webwatcher: Breaking new frontier of vision-language deep research agent. arXiv preprint arXiv:2508.05748 (2025) 3, 9, 11

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025) 7

  14. [14]

    Seed1.5-VL Technical Report

    Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., et al.: Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062 (2025) 1

  15. [15]

    Deepeyesv2: Toward agentic multimodal model

    Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271 (2025) 11

  16. [16]

    In: CVPR (2023) 3 16 F

    Hu, Z., Iscen, A., Sun, C., Wang, Z., Chang, K.W., Sun, Y., Schmid, C., Ross, D.A., Fathi, A.: Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In: CVPR (2023) 3 16 F. Author et al

  17. [17]

    Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

    Huang, W., Zeng, Y., Wang, Q., Fang, Z., Cao, S., Chu, Z., Yin, Q., Chen, S., Yin, Z., Chen, L., et al.: Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models. arXiv preprint arXiv:2601.22060 (2026) 2, 3, 10, 11

  18. [18]

    In: EMNLP (2025) 9

    Jia, Y., Li, J., Yue, X., Li, B., Nie, P., Zou, K., Chen, W.: Visualwebinstruct: Scaling up multimodal instruction data through web search. In: EMNLP (2025) 9

  19. [19]

    In: ICLR (2025) 9

    Jiang, D., Zhang, R., Guo, Z., Wu, Y., Qiu, P., Lu, P., Chen, Z., Song, G., Gao, P., Liu, Y., et al.: Mmsearch: Unveiling the potential of large models as multi-modal search engines. In: ICLR (2025) 9

  20. [20]

    In: EMNLP (2020) 3

    Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. In: EMNLP (2020) 3

  21. [21]

    In: NIPS (2020) 3

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: NIPS (2020) 3

  22. [22]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 5

  23. [23]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al.: Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556 (2025) 2

  24. [24]

    In: CVPR (2024) 10

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 10

  25. [25]

    In: ACL (2024) 2

    Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. In: ACL (2024) 2

  26. [26]

    In: CVPR (2025) 3

    Liu, Y., Zhang, Y., Cai, J., Jiang, X., Hu, Y., Yao, J., Wang, Y., Xie, W.: Lamra: Large multimodal model as your advanced retrieval assistant. In: CVPR (2025) 3

  27. [27]

    5: Building a vision-language model towards real world applications

    Liu, Y., Tian, L., Zhou, X., Gao, X., Yu, K., Yu, Y., Zhou, J.: Points1. 5: Building a vision-language model towards real world applications. arXiv preprint arXiv:2412.08443 (2024) 1

  28. [28]

    In: EMNLP (2025) 1, 19

    Liu, Y., Zhao, Z., Tian, L., Wang, H., Ye, X., You, Y., Yu, Z., Wu, C., Xiao, Z., Yu, Y., et al.: Points-reader: Distillation-free adaptation of vision-language models for document conversion. In: EMNLP (2025) 1, 19

  29. [29]

    Visual agentic reinforcement fine-tuning

    Liu,Z.,Zang,Y.,Zou,Y.,Liang,Z.,Dong,X.,Cao,Y.,Duan,H.,Lin,D.,Wang,J.: Visual agentic reinforcement fine-tuning. arXiv preprint arXiv:2505.14246 (2025) 1

  30. [30]

    In: ICCV (2023) 3

    Mensink, T., Uijlings, J., Castrejon, L., Goel, A., Cadar, F., Zhou, H., Sha, F., Araujo, A., Ferrari, V.: Encyclopedic vqa: Visual questions about detailed proper- ties of fine-grained categories. In: ICCV (2023) 3

  31. [31]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 19

  32. [32]

    M., Shiee, N., Grasch, P., Jia, C., Yang, Y ., and Gan, Z

    Narayan, K., Xu, Y., Cao, T., Nerella, K., Patel, V.M., Shiee, N., Grasch, P., Jia, C., Yang, Y., Gan, Z.: Deepmmsearch-r1: Empowering multimodal llms in multimodal web search. arXiv preprint arXiv:2510.12801 (2025) 1, 2, 4, 11

  33. [33]

    In: NIPS (2022) 19 Abbreviated paper title 17

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NIPS (2022) 19 Abbreviated paper title 17

  34. [34]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024) 20

  35. [35]

    arXiv preprint arXiv:2601.21468 , year=

    Shi, Y., Liu, S., Yang, Y., Mao, W., Chen, Y., Gu, Q., Su, H., Cai, X., Wang, X., Zhang, A.: Memocr: Layout-aware visual memory for efficient long-horizon reasoning. arXiv preprint arXiv:2601.21468 (2026) 4, 9

  36. [36]

    arXiv preprint arXiv:2602.05829 (2026) 3

    Shi, Y., Di, S., Chen, Q., Wang, Q., Cai, J., Jiang, X., Hu, Y., Xie, W.: Weaver: End-to-end agentic system training for video interleaved reasoning. arXiv preprint arXiv:2602.05829 (2026) 3

  37. [37]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model par- allelism. arXiv preprint arXiv:1909.08053 (2019) 20

  38. [38]

    Neurocomputing (2024) 21

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing (2024) 21

  39. [39]

    arXiv e-prints pp

    Tan, H., Ji, Y., Hao, X., Lin, M., Wang, P., Wang, Z., Zhang, S.: Reason-rft: Re- inforcement fine-tuning for visual reasoning. arXiv e-prints pp. arXiv–2503 (2025) 9

  40. [40]

    Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

    Tao, X., Teng, Y., Su, X., Fu, X., Wu, J., Tao, C., Liu, Z., Bai, H., Liu, R., Kong, L.: Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents. arXiv preprint arXiv:2508.21475 (2025) 9

  41. [41]

    Kimi-VL Technical Report

    Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al.: Kimi-vl technical report. arXiv preprint arXiv:2504.07491 (2025) 1

  42. [42]

    arXiv preprint arXiv:2411.10557 (2024) 5

    Tu, J., Ni, Z., Crispino, N., Yu, Z., Bendersky, M., Gunel, B., Jia, R., Liu, X., Lyu, L., Song, D., et al.: Mlan: Language-based instruction tuning preserves and trans- fers knowledge in multimodal language models. arXiv preprint arXiv:2411.10557 (2024) 5

  43. [43]

    UncleCode: Crawl4ai: Open-source llm friendly web crawler & scraper.https: //github.com/unclecode/crawl4ai(2024) 10

  44. [44]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 10, 21

  45. [45]

    arXiv preprint arXiv:2505.22019 (2025)

    Wang, Q., Ding, R., Zeng, Y., Chen, Z., Chen, L., Wang, S., Xie, P., Huang, F., Zhao, F.: Vrag-rl: Empower vision-perception-based rag for visually rich infor- mation understanding via iterative reasoning with reinforcement learning. arXiv preprint arXiv:2505.22019 (2025) 4

  46. [46]

    URL https://openreview.net/forum?id=ehfRiF0R3a

    Wang, Y., Takanobu, R., Liang, Z., Mao, Y., Hu, Y., McAuley, J., Wu, X.: Mem-{\alpha}: Learning memory construction via reinforcement learning. arXiv preprint arXiv:2509.25911 (2025) 4

  47. [47]

    DeepSeek-OCR: Contexts Optical Compression

    Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234 (2025) 4

  48. [48]

    Open vision reasoner: Transferring linguis- tic cognitive behavior for visual reasoning.arXiv preprint arXiv:2507.05255, 2025

    Wei, Y., Zhao, L., Sun, J., Lin, K., Yin, J., Hu, J., Zhang, Y., Yu, E., Lv, H., Weng, Z., et al.: Open vision reasoner: Transferring linguistic cognitive behavior for visual reasoning. arXiv preprint arXiv:2507.05255 (2025) 5, 20

  49. [49]

    Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

    Wu, J., Deng, Z., Li, W., Liu, Y., You, B., Li, B., Ma, Z., Liu, Z.: Mmsearch-r1: Incentivizing lmms to search. arXiv preprint arXiv:2506.20670 (2025) 1, 2, 4, 6, 9, 11

  50. [50]

    Open data synthesis for deep research.arXiv preprint arXiv:2509.00375, 2025

    Xia, Z., Luo, K., Qian, H., Liu, Z.: Open data synthesis for deep research. arXiv preprint arXiv:2509.00375 (2025) 6, 20 18 F. Author et al

  51. [51]

    arXiv preprint arXiv:2510.01179 (2025) 9

    Xu, Z., Soria, A.M., Tan, S., Roy, A., Agrawal, A.S., Poovendran, R., Panda, R.: Toucan: Synthesizing 1.5 m tool-agentic data from real-world mcp environments. arXiv preprint arXiv:2510.01179 (2025) 9

  52. [52]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Yan, S., Yang, X., Huang, Z., Nie, E., Ding, Z., Li, Z., Ma, X., Bi, J., Kersting, K., Pan, J.Z., et al.: Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning. arXiv preprint arXiv:2508.19828 (2025) 4

  53. [53]

    In: Findings of EMNLP (2024) 3

    Yan, Y., Xie, W.: Echosight: Advancing visual-language models with wiki knowl- edge. In: Findings of EMNLP (2024) 3

  54. [54]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 10, 21

  55. [55]

    Mm-deepresearch: A simple and effective multimodal agentic search baseline.arXiv preprint arXiv:2603.01050, 2026

    Yao, H., Yin, Q., Yang, M., Zhao, Z., Wang, Y., Luo, H., Zhang, J., Huang, J.: Mm-deepresearch: A simple and effective multimodal agentic search baseline. arXiv preprint arXiv:2603.01050 (2026) 1, 11

  56. [56]

    In: ICLR (2022) 2, 5

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: ICLR (2022) 2, 5

  57. [57]

    Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

    Ye, R., Zhang, Z., Li, K., Yin, H., Tao, Z., Zhao, Y., Su, L., Zhang, L., Qiao, Z., Wang, X., et al.: Agentfold: Long-horizon web agents with proactive context management. arXiv preprint arXiv:2510.24699 (2025) 4

  58. [58]

    arXiv preprint arXiv:2507.02259 , year=

    Yu, H., Chen, T., Feng, J., Chen, J., Dai, W., Yu, Q., Zhang, Y.Q., Ma, W.Y., Liu, J., Wang, M., et al.: Memagent: Reshaping long-context llm with multi-conv rl-based memory agent. arXiv preprint arXiv:2507.02259 (2025) 2, 4

  59. [59]

    Visrag: Vision-based retrieval-augmented generation on multi-modality documents

    Yu, S., Tang, C., Xu, B., Cui, J., Ran, J., Yan, Y., Liu, Z., Wang, S., Han, X., Liu, Z., et al.: Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594 (2024) 3

  60. [60]

    Reconstruct Picture i with the highest possible fidelity

    Zhang, Y., Ni, B., Chen, X.S., Zhang, H.R., Rao, Y., Peng, H., Lu, Q., Hu, H., Guo, M.H., Hu, S.M.: Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms. arXiv preprint arXiv:2510.13795 (2025) 9

  61. [61]

    Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

    Zhang, Y., Hu, L., Sun, H., Wang, P., Wei, Y., Yin, S., Pei, J., Shen, W., Xia, P., Peng, Y., et al.: Skywork-r1v4: Toward agentic multimodal intelligence through in- terleaved thinking with images and deepresearch. arXiv preprint arXiv:2512.02395 (2025) 2, 11

  62. [62]

    1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

    Zhao, H., Wang, H., Peng, Y., Zhao, S., Tian, X., Chen, S., Ji, Y., Li, X.: 1.4 million open-source distilled reasoning dataset to empower large language model training. arXiv preprint arXiv:2503.19633 (2025) 9

  63. [63]

    In: NIPS (2024) 20

    Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C.H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J.E., et al.: Sglang: Efficient execution of structured language model programs. In: NIPS (2024) 20

  64. [64]

    In: EMNLP (2025) 3

    Zheng, Y., Fu, D., Hu, X., Cai, X., Ye, L., Lu, P., Liu, P.: Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. In: EMNLP (2025) 3

  65. [65]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 19 Abbreviated paper title 19 POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch Supplementary...

  66. [66]

    In this tag, you need to conduct a step-by-step reasoning process about the image and question and evaluate whether tool use would be helpful and give the reason

    In each turn, you should start with <think> tag. In this tag, you need to conduct a step-by-step reasoning process about the image and question and evaluate whether tool use would be helpful and give the reason. If received tool results, you also need to analyze them

  67. [67]

    **Remember: You can only make one single tool call per turn.**

    If you think some tools are useful, call them in <tool_call> tag. **Remember: You can only make one single tool call per turn.**

  68. [68]

    type": "function

    If you think no more tools are needed, you can answer in <answer> tag. You need to provide a concise summary of your reasoning process that leads to the final answer. Besides, you also need to put a simple and direct answer in \boxed{{}} for verification. The structure of your response should be like this: <think> thinking process satisfying the Instructi...

  69. [69]

    TARGET ENTITY != ANSWER: The entity you select to mask (the Target Entity) MUST NOT be the same as the Answer

  70. [70]

    VISUAL ANCHORING: Select an entity from the QUESTION that serves as a visual anchor (e.g., a location, a related object, or the subject’s environment)

  71. [71]

    MAINTAIN COMPLEXITY: Keep all the complex reasoning steps from the original question

  72. [72]

    this location

    MASKING: Replace the selected Target Entity with a visual reference token (e.g., "this location", "this botanical garden", " the habitat shown here"). ### OUTPUT FORMAT (JSON ONLY): { "status": "success" | "discard", "selected_target_entity": "The entity chosen from the QUESTION ( must not be the answer)", "transformed_vqa_query": "The rewritten complex q...

  73. [73]

    **Semantic Equivalence**: Respond "yes" if the core meaning, facts, and conclusions are the same, even if wording, syntax, or language differs

  74. [74]

    **Detail Level**: If the Model Output is more detailed than the Ground Truth but maintains the same core fact, it is consistent (" yes")

  75. [75]

    **Fact Conflict**: Respond "no" if there is a contradiction, a wrong numerical value, or a different entity mentioned

  76. [76]

    - **Ground Truth**:

    **Format Neutrality**: Ignore casing, punctuation, and minor spelling errors. ### Example 1 - **Question**: "What official residence was directed to be built by a colonial Governor of Dominica and whose gardens were damaged by hurricanes in 1916, 1930, and 1979? Note: The residence was built near the location shown in the image." - **Ground Truth**: "Gove...

  77. [77]

    LeBron James to be Team USA’s male flag bearer for Olympic

    The fourth result specifically states "LeBron James to be Team USA’s male flag bearer for Olympic" 3. The fifth result explicitly states "the basketball player wearing the number 6 jersey for the USA team is LeBron James" The image shows two players in USA uniforms - one wearing number 6 (LeBron James) and another wearing number 7 (Kevin Durant, as visibl...

  78. [78]

    The Cantor set is constructed by recursively removing the middle third from a line segment

  79. [79]

    binary fractal tree,

    The Hausdorff dimension of the Cantor set is log2/log3, which equals approximately 0.6309... However, I need to verify if the image actually shows the Cantor set or a different fractal. The image search initially identified it as a "binary fractal tree," but the question specifically mentions "removing the middle third from a line segment." Let me search ...

  80. [80]

    The question might be misdescribing the figure - it could actually be a Cantor set or Cantor dust

Showing first 80 references.