Recognition: unknown
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
Pith reviewed 2026-05-10 13:05 UTC · model grok-4.3
The pith
Training a multimodal agentic search model from scratch with Agentic Seeding and V-Fold yields consistent gains over retrofitted models on long-horizon visual reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that POINTS-Seeker-8B, trained entirely from scratch, reaches state-of-the-art results across six benchmarks by first using Agentic Seeding to embed precursors of agentic behavior and then applying V-Fold to compress expanding interaction histories. V-Fold keeps the most recent dialogue turns at full fidelity while folding earlier context into the visual space through rendering, thereby removing the bottleneck that otherwise overwhelms the model when histories become long. The resulting system therefore handles knowledge-intensive visual reasoning more reliably than models that simply attach search tools to pretrained backbones.
What carries the argument
Agentic Seeding combined with V-Fold: Agentic Seeding is the dedicated early training phase that installs foundational agentic behaviors, while V-Fold is the adaptive compression method that preserves recent turns verbatim and renders historical context as images to reduce token load.
If this is right
- Longer interaction sequences become feasible without loss of evidence location accuracy.
- Knowledge-intensive visual reasoning tasks can be solved through repeated external retrieval rather than reliance on parametric memory alone.
- Specialized agentic behaviors can be instilled without first training a general multimodal foundation model.
- History compression via visual rendering offers a scalable way to extend context length in multimodal agents.
Where Pith is reading between the lines
- Similar seeding and folding techniques could be tested on language-only agentic models to check whether the from-scratch advantage generalizes beyond vision.
- If V-Fold proves robust, it suggests a broader design pattern for any long-context agent where recent actions stay explicit and older state is summarized in a modality the model already processes well.
- The results raise the question of whether the data used for Agentic Seeding can be further distilled to reduce overall training cost while retaining the observed gains.
Load-bearing premise
The performance advantages arise specifically from the from-scratch training regimen, Agentic Seeding phase, and V-Fold compression rather than from differences in model scale, training data, or benchmark construction.
What would settle it
A side-by-side comparison in which a similarly sized retrofitted model, trained on equivalent data and evaluated on the same six benchmarks, matches or exceeds POINTS-Seeker-8B accuracy.
Figures
read the original abstract
While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes POINTS-Seeker-8B, a multimodal agentic search model trained entirely from scratch rather than by retrofitting existing LMMs. It introduces Agentic Seeding as an initial phase to instill foundational agentic behaviors and V-Fold as an adaptive compression technique that renders historical context visually to mitigate overload in long-horizon interactions. The central claim is that this training paradigm yields a model that consistently outperforms prior approaches across six diverse benchmarks for knowledge-intensive visual reasoning.
Significance. If the performance claims are supported by controlled experiments, the work would be significant because it directly tests whether purpose-built from-scratch training for agentic search can surpass the prevailing retrofit paradigm. The V-Fold mechanism addresses a concrete, widely observed failure mode (context overload in extended tool-use trajectories) with a potentially generalizable visual-folding approach. Explicit credit is due for focusing on long-horizon bottlenecks rather than short single-turn retrieval.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts that POINTS-Seeker-8B 'consistently outperforms existing models across six diverse benchmarks' and 'effectively resolv[es] the challenges of long-horizon, knowledge-intensive visual reasoning,' yet supplies no numerical results, baseline comparisons, or error bars. This is load-bearing for the central claim; without these data the attribution of gains to Agentic Seeding and V-Fold versus differences in scale, data volume, or evaluation protocol cannot be evaluated.
- [§3 and §4] §3 (Method) and §4: No ablations are described that isolate Agentic Seeding or V-Fold (e.g., full model vs. variant without Agentic Seeding, vs. variant without V-Fold, or vs. retrofitted LMMs at identical 8B scale and training tokens) under matched search-tool interfaces and benchmark protocols. The stress-test concern therefore lands: the causal claim that the new training phases resolve the long-horizon bottleneck rests on an untested assumption.
minor comments (2)
- [§3] The acronym 'V-Fold' is introduced without an explicit expansion or diagram showing the rendering step; a small illustrative figure would improve clarity.
- [Abstract] The abstract uses 'parameter-free' phrasing in passing; if any component of V-Fold or the seeding loss contains tunable hyperparameters, this should be stated explicitly.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the need for clearer empirical support. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and ablations.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts that POINTS-Seeker-8B 'consistently outperforms existing models across six diverse benchmarks' and 'effectively resolv[es] the challenges of long-horizon, knowledge-intensive visual reasoning,' yet supplies no numerical results, baseline comparisons, or error bars. This is load-bearing for the central claim; without these data the attribution of gains to Agentic Seeding and V-Fold versus differences in scale, data volume, or evaluation protocol cannot be evaluated.
Authors: We agree that the abstract and opening of §4 should explicitly include numerical results, baseline comparisons, and error bars to substantiate the claims. The current draft states the performance outcomes qualitatively without quoting specific metrics or error bars in these sections. We will revise the abstract to report key benchmark scores and average improvements, and add a concise summary paragraph with numerical values, baselines, and error bars at the start of the Experiments section. This will allow direct assessment of gains attributable to our methods versus scale or data factors. revision: yes
-
Referee: [§3 and §4] §3 (Method) and §4: No ablations are described that isolate Agentic Seeding or V-Fold (e.g., full model vs. variant without Agentic Seeding, vs. variant without V-Fold, or vs. retrofitted LMMs at identical 8B scale and training tokens) under matched search-tool interfaces and benchmark protocols. The stress-test concern therefore lands: the causal claim that the new training phases resolve the long-horizon bottleneck rests on an untested assumption.
Authors: We acknowledge that the manuscript does not currently include ablations isolating Agentic Seeding or V-Fold. To address this, we will add a dedicated ablation subsection in the revised §4. This will report results for the full model versus variants without Agentic Seeding, without V-Fold (replacing it with standard text history), and versus an 8B-scale retrofitted LMM baseline, all under matched tool interfaces, data volume, and benchmark protocols. While these require additional compute, we will include them to directly test the causal contributions to long-horizon performance. revision: yes
Circularity Check
No circularity: empirical training claims rest on experimental results, not self-referential derivations
full rationale
The manuscript describes an empirical training pipeline for POINTS-Seeker-8B using two new techniques (Agentic Seeding and V-Fold) and reports benchmark outperformance. No equations, fitted parameters, or first-principles derivations appear in the provided text. The central claims are supported by experimental comparisons rather than any reduction of outputs to inputs by construction, self-citation chains, or renamed known results. This is a standard empirical ML paper whose validity hinges on ablation quality and baseline fairness, not on logical circularity in a derivation chain.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
In: EMNLP (2022) 3
Chen, W., Hu, H., Chen, X., Verga, P., Cohen, W.: Murag: Multimodal retrieval- augmented generator for open question answering over images and text. In: EMNLP (2022) 3
2022
-
[3]
arXiv preprint arXiv:2407.21439 , year=
Chen, Z., Xu, C., Qi, Y., Guo, J.: Mllm is a strong reranker: Advancing mul- timodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training. arXiv preprint arXiv:2407.21439 (2024) 3
-
[4]
arXiv preprint arXiv:2510.17800 (2025) 11
Cheng, J., Liu, Y., Zhang, X., Fei, Y., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X., Liu, X., et al.: Glyph: Scaling context windows via visual-text compression. arXiv preprint arXiv:2510.17800 (2025) 4, 9, 21
-
[5]
In: ICCV (2025) 9
Cheng, X., Zhang, W., Zhang, S., Yang, J., Guan, X., Wu, X., Li, X., Zhang, G., Liu, J., Mai, Y., et al.: Simplevqa: Multimodal factuality evaluation for multimodal large language models. In: ICCV (2025) 9
2025
-
[6]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Chhikara, P., Khant, D., Aryan, S., Singh, T., Yadav, D.: Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413 (2025) 4
work page internal anchor Pith review arXiv 2025
-
[7]
Chng, Y.X., Hu, T., Tong, W., Li, X., Chen, J., Yu, H., Lu, J., Guo, H., Deng, H., Xie, C., Huang, G., Lin, D., Lu, L.: Sensenova-mars: Empowering multi- modal agentic reasoning and search via reinforcement learning. arXiv preprint arXiv:2512.24330 (2025) 11
-
[8]
Chu, Z., Wang, X., Hong, J., Fan, H., Huang, Y., Yang, Y., Xu, G., Hu, S., Kuang, D., Zhao, C., Xiang, C., Liu, M., Qin, B., Yu, X.: Redsearcher: A scalable and cost- efficient framework for long-horizon search agents. arXiv preprint arXiv:2602.14234 (2026),https://arxiv.org/pdf/2602.142343
-
[9]
PaddleOCR 3.0 Technical Report
Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al.: Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595 (2025) 19
work page internal anchor Pith review arXiv 2025
-
[10]
Feng, L., Yang, F., Chen, F., Cheng, X., Xu, H., Wan, Z., Yan, M., An, B.: Agentocr: Reimagining agent history via optical self-compression. arXiv preprint arXiv:2601.04786 (2026) 4, 9
-
[11]
Fu, M., Peng, Y., Chen, D., Zhou, Z., Liu, B., Wan, Y., Zhao, Z., Yu, P.S., Krishna, R.: Seeking and updating with live visual knowledge. arXiv preprint arXiv:2504.05288 (2025) 6, 9
-
[12]
Geng, X., Xia, P., Zhang, Z., Wang, X., Wang, Q., Ding, R., Wang, C., Wu, J., Zhao, Y., Li, K., et al.: Webwatcher: Breaking new frontier of vision-language deep research agent. arXiv preprint arXiv:2508.05748 (2025) 3, 9, 11
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025) 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., et al.: Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062 (2025) 1
work page internal anchor Pith review arXiv 2025
-
[15]
Deepeyesv2: Toward agentic multimodal model
Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271 (2025) 11
-
[16]
In: CVPR (2023) 3 16 F
Hu, Z., Iscen, A., Sun, C., Wang, Z., Chang, K.W., Sun, Y., Schmid, C., Ross, D.A., Fathi, A.: Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In: CVPR (2023) 3 16 F. Author et al
2023
-
[17]
Huang, W., Zeng, Y., Wang, Q., Fang, Z., Cao, S., Chu, Z., Yin, Q., Chen, S., Yin, Z., Chen, L., et al.: Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models. arXiv preprint arXiv:2601.22060 (2026) 2, 3, 10, 11
-
[18]
In: EMNLP (2025) 9
Jia, Y., Li, J., Yue, X., Li, B., Nie, P., Zou, K., Chen, W.: Visualwebinstruct: Scaling up multimodal instruction data through web search. In: EMNLP (2025) 9
2025
-
[19]
In: ICLR (2025) 9
Jiang, D., Zhang, R., Guo, Z., Wu, Y., Qiu, P., Lu, P., Chen, Z., Song, G., Gao, P., Liu, Y., et al.: Mmsearch: Unveiling the potential of large models as multi-modal search engines. In: ICLR (2025) 9
2025
-
[20]
In: EMNLP (2020) 3
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. In: EMNLP (2020) 3
2020
-
[21]
In: NIPS (2020) 3
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: NIPS (2020) 3
2020
-
[22]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 5
work page Pith review arXiv 2024
-
[23]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al.: Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556 (2025) 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
In: CVPR (2024) 10
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 10
2024
-
[25]
In: ACL (2024) 2
Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. In: ACL (2024) 2
2024
-
[26]
In: CVPR (2025) 3
Liu, Y., Zhang, Y., Cai, J., Jiang, X., Hu, Y., Yao, J., Wang, Y., Xie, W.: Lamra: Large multimodal model as your advanced retrieval assistant. In: CVPR (2025) 3
2025
-
[27]
5: Building a vision-language model towards real world applications
Liu, Y., Tian, L., Zhou, X., Gao, X., Yu, K., Yu, Y., Zhou, J.: Points1. 5: Building a vision-language model towards real world applications. arXiv preprint arXiv:2412.08443 (2024) 1
-
[28]
In: EMNLP (2025) 1, 19
Liu, Y., Zhao, Z., Tian, L., Wang, H., Ye, X., You, Y., Yu, Z., Wu, C., Xiao, Z., Yu, Y., et al.: Points-reader: Distillation-free adaptation of vision-language models for document conversion. In: EMNLP (2025) 1, 19
2025
-
[29]
Visual agentic reinforcement fine-tuning
Liu,Z.,Zang,Y.,Zou,Y.,Liang,Z.,Dong,X.,Cao,Y.,Duan,H.,Lin,D.,Wang,J.: Visual agentic reinforcement fine-tuning. arXiv preprint arXiv:2505.14246 (2025) 1
-
[30]
In: ICCV (2023) 3
Mensink, T., Uijlings, J., Castrejon, L., Goel, A., Cadar, F., Zhou, H., Sha, F., Araujo, A., Ferrari, V.: Encyclopedic vqa: Visual questions about detailed proper- ties of fine-grained categories. In: ICCV (2023) 3
2023
-
[31]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 19
work page internal anchor Pith review arXiv 2024
-
[32]
M., Shiee, N., Grasch, P., Jia, C., Yang, Y ., and Gan, Z
Narayan, K., Xu, Y., Cao, T., Nerella, K., Patel, V.M., Shiee, N., Grasch, P., Jia, C., Yang, Y., Gan, Z.: Deepmmsearch-r1: Empowering multimodal llms in multimodal web search. arXiv preprint arXiv:2510.12801 (2025) 1, 2, 4, 11
-
[33]
In: NIPS (2022) 19 Abbreviated paper title 17
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NIPS (2022) 19 Abbreviated paper title 17
2022
-
[34]
HybridFlow: A Flexible and Efficient RLHF Framework
Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024) 20
work page internal anchor Pith review arXiv 2024
-
[35]
arXiv preprint arXiv:2601.21468 , year=
Shi, Y., Liu, S., Yang, Y., Mao, W., Chen, Y., Gu, Q., Su, H., Cai, X., Wang, X., Zhang, A.: Memocr: Layout-aware visual memory for efficient long-horizon reasoning. arXiv preprint arXiv:2601.21468 (2026) 4, 9
-
[36]
arXiv preprint arXiv:2602.05829 (2026) 3
Shi, Y., Di, S., Chen, Q., Wang, Q., Cai, J., Jiang, X., Hu, Y., Xie, W.: Weaver: End-to-end agentic system training for video interleaved reasoning. arXiv preprint arXiv:2602.05829 (2026) 3
-
[37]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model par- allelism. arXiv preprint arXiv:1909.08053 (2019) 20
work page internal anchor Pith review arXiv 1909
-
[38]
Neurocomputing (2024) 21
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing (2024) 21
2024
-
[39]
arXiv e-prints pp
Tan, H., Ji, Y., Hao, X., Lin, M., Wang, P., Wang, Z., Zhang, S.: Reason-rft: Re- inforcement fine-tuning for visual reasoning. arXiv e-prints pp. arXiv–2503 (2025) 9
2025
-
[40]
Tao, X., Teng, Y., Su, X., Fu, X., Wu, J., Tao, C., Liu, Z., Bai, H., Liu, R., Kong, L.: Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents. arXiv preprint arXiv:2508.21475 (2025) 9
-
[41]
Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al.: Kimi-vl technical report. arXiv preprint arXiv:2504.07491 (2025) 1
work page internal anchor Pith review arXiv 2025
-
[42]
arXiv preprint arXiv:2411.10557 (2024) 5
Tu, J., Ni, Z., Crispino, N., Yu, Z., Bendersky, M., Gunel, B., Jia, R., Liu, X., Lyu, L., Song, D., et al.: Mlan: Language-based instruction tuning preserves and trans- fers knowledge in multimodal language models. arXiv preprint arXiv:2411.10557 (2024) 5
-
[43]
UncleCode: Crawl4ai: Open-source llm friendly web crawler & scraper.https: //github.com/unclecode/crawl4ai(2024) 10
2024
-
[44]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 10, 21
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
arXiv preprint arXiv:2505.22019 (2025)
Wang, Q., Ding, R., Zeng, Y., Chen, Z., Chen, L., Wang, S., Xie, P., Huang, F., Zhao, F.: Vrag-rl: Empower vision-perception-based rag for visually rich infor- mation understanding via iterative reasoning with reinforcement learning. arXiv preprint arXiv:2505.22019 (2025) 4
-
[46]
URL https://openreview.net/forum?id=ehfRiF0R3a
Wang, Y., Takanobu, R., Liang, Z., Mao, Y., Hu, Y., McAuley, J., Wu, X.: Mem-{\alpha}: Learning memory construction via reinforcement learning. arXiv preprint arXiv:2509.25911 (2025) 4
-
[47]
DeepSeek-OCR: Contexts Optical Compression
Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234 (2025) 4
work page internal anchor Pith review arXiv 2025
-
[48]
Wei, Y., Zhao, L., Sun, J., Lin, K., Yin, J., Hu, J., Zhang, Y., Yu, E., Lv, H., Weng, Z., et al.: Open vision reasoner: Transferring linguistic cognitive behavior for visual reasoning. arXiv preprint arXiv:2507.05255 (2025) 5, 20
-
[49]
Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025
Wu, J., Deng, Z., Li, W., Liu, Y., You, B., Li, B., Ma, Z., Liu, Z.: Mmsearch-r1: Incentivizing lmms to search. arXiv preprint arXiv:2506.20670 (2025) 1, 2, 4, 6, 9, 11
-
[50]
Open data synthesis for deep research.arXiv preprint arXiv:2509.00375, 2025
Xia, Z., Luo, K., Qian, H., Liu, Z.: Open data synthesis for deep research. arXiv preprint arXiv:2509.00375 (2025) 6, 20 18 F. Author et al
-
[51]
arXiv preprint arXiv:2510.01179 (2025) 9
Xu, Z., Soria, A.M., Tan, S., Roy, A., Agrawal, A.S., Poovendran, R., Panda, R.: Toucan: Synthesizing 1.5 m tool-agentic data from real-world mcp environments. arXiv preprint arXiv:2510.01179 (2025) 9
-
[52]
Yan, S., Yang, X., Huang, Z., Nie, E., Ding, Z., Li, Z., Ma, X., Bi, J., Kersting, K., Pan, J.Z., et al.: Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning. arXiv preprint arXiv:2508.19828 (2025) 4
work page internal anchor Pith review arXiv 2025
-
[53]
In: Findings of EMNLP (2024) 3
Yan, Y., Xie, W.: Echosight: Advancing visual-language models with wiki knowl- edge. In: Findings of EMNLP (2024) 3
2024
-
[54]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 10, 21
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Yao, H., Yin, Q., Yang, M., Zhao, Z., Wang, Y., Luo, H., Zhang, J., Huang, J.: Mm-deepresearch: A simple and effective multimodal agentic search baseline. arXiv preprint arXiv:2603.01050 (2026) 1, 11
-
[56]
In: ICLR (2022) 2, 5
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: ICLR (2022) 2, 5
2022
-
[57]
Ye, R., Zhang, Z., Li, K., Yin, H., Tao, Z., Zhao, Y., Su, L., Zhang, L., Qiao, Z., Wang, X., et al.: Agentfold: Long-horizon web agents with proactive context management. arXiv preprint arXiv:2510.24699 (2025) 4
-
[58]
arXiv preprint arXiv:2507.02259 , year=
Yu, H., Chen, T., Feng, J., Chen, J., Dai, W., Yu, Q., Zhang, Y.Q., Ma, W.Y., Liu, J., Wang, M., et al.: Memagent: Reshaping long-context llm with multi-conv rl-based memory agent. arXiv preprint arXiv:2507.02259 (2025) 2, 4
-
[59]
Visrag: Vision-based retrieval-augmented generation on multi-modality documents
Yu, S., Tang, C., Xu, B., Cui, J., Ran, J., Yan, Y., Liu, Z., Wang, S., Han, X., Liu, Z., et al.: Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594 (2024) 3
-
[60]
Reconstruct Picture i with the highest possible fidelity
Zhang, Y., Ni, B., Chen, X.S., Zhang, H.R., Rao, Y., Peng, H., Lu, Q., Hu, H., Guo, M.H., Hu, S.M.: Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms. arXiv preprint arXiv:2510.13795 (2025) 9
-
[61]
Zhang, Y., Hu, L., Sun, H., Wang, P., Wei, Y., Yin, S., Pei, J., Shen, W., Xia, P., Peng, Y., et al.: Skywork-r1v4: Toward agentic multimodal intelligence through in- terleaved thinking with images and deepresearch. arXiv preprint arXiv:2512.02395 (2025) 2, 11
-
[62]
Zhao, H., Wang, H., Peng, Y., Zhao, S., Tian, X., Chen, S., Ji, Y., Li, X.: 1.4 million open-source distilled reasoning dataset to empower large language model training. arXiv preprint arXiv:2503.19633 (2025) 9
-
[63]
In: NIPS (2024) 20
Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C.H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J.E., et al.: Sglang: Efficient execution of structured language model programs. In: NIPS (2024) 20
2024
-
[64]
In: EMNLP (2025) 3
Zheng, Y., Fu, D., Hu, X., Cai, X., Ye, L., Lu, P., Liu, P.: Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. In: EMNLP (2025) 3
2025
-
[65]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 19 Abbreviated paper title 19 POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch Supplementary...
work page internal anchor Pith review arXiv 2025
-
[66]
In this tag, you need to conduct a step-by-step reasoning process about the image and question and evaluate whether tool use would be helpful and give the reason
In each turn, you should start with <think> tag. In this tag, you need to conduct a step-by-step reasoning process about the image and question and evaluate whether tool use would be helpful and give the reason. If received tool results, you also need to analyze them
-
[67]
**Remember: You can only make one single tool call per turn.**
If you think some tools are useful, call them in <tool_call> tag. **Remember: You can only make one single tool call per turn.**
-
[68]
type": "function
If you think no more tools are needed, you can answer in <answer> tag. You need to provide a concise summary of your reasoning process that leads to the final answer. Besides, you also need to put a simple and direct answer in \boxed{{}} for verification. The structure of your response should be like this: <think> thinking process satisfying the Instructi...
-
[69]
TARGET ENTITY != ANSWER: The entity you select to mask (the Target Entity) MUST NOT be the same as the Answer
-
[70]
VISUAL ANCHORING: Select an entity from the QUESTION that serves as a visual anchor (e.g., a location, a related object, or the subject’s environment)
-
[71]
MAINTAIN COMPLEXITY: Keep all the complex reasoning steps from the original question
-
[72]
this location
MASKING: Replace the selected Target Entity with a visual reference token (e.g., "this location", "this botanical garden", " the habitat shown here"). ### OUTPUT FORMAT (JSON ONLY): { "status": "success" | "discard", "selected_target_entity": "The entity chosen from the QUESTION ( must not be the answer)", "transformed_vqa_query": "The rewritten complex q...
-
[73]
**Semantic Equivalence**: Respond "yes" if the core meaning, facts, and conclusions are the same, even if wording, syntax, or language differs
-
[74]
**Detail Level**: If the Model Output is more detailed than the Ground Truth but maintains the same core fact, it is consistent (" yes")
-
[75]
**Fact Conflict**: Respond "no" if there is a contradiction, a wrong numerical value, or a different entity mentioned
-
[76]
- **Ground Truth**:
**Format Neutrality**: Ignore casing, punctuation, and minor spelling errors. ### Example 1 - **Question**: "What official residence was directed to be built by a colonial Governor of Dominica and whose gardens were damaged by hurricanes in 1916, 1930, and 1979? Note: The residence was built near the location shown in the image." - **Ground Truth**: "Gove...
1916
-
[77]
LeBron James to be Team USA’s male flag bearer for Olympic
The fourth result specifically states "LeBron James to be Team USA’s male flag bearer for Olympic" 3. The fifth result explicitly states "the basketball player wearing the number 6 jersey for the USA team is LeBron James" The image shows two players in USA uniforms - one wearing number 6 (LeBron James) and another wearing number 7 (Kevin Durant, as visibl...
-
[78]
The Cantor set is constructed by recursively removing the middle third from a line segment
-
[79]
binary fractal tree,
The Hausdorff dimension of the Cantor set is log2/log3, which equals approximately 0.6309... However, I need to verify if the image actually shows the Cantor set or a different fractal. The image search initially identified it as a "binary fractal tree," but the question specifically mentions "removing the middle third from a line segment." Let me search ...
-
[80]
The question might be misdescribing the figure - it could actually be a Cantor set or Cantor dust
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.