{"total":14,"items":[{"citing_arxiv_id":"2606.31504","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SimpleSearch-VL: A Simple Recipe for Multimodal Agentic Deep Search","primary_cat":"cs.CV","submitted_at":"2026-06-30T11:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SimpleSearch-VL improves Qwen3-VL multimodal agent baselines by 15.8-16 points on average using 7K total training examples and reaches parity with Gemini-3-Pro on the 30B variant.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05784","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents","primary_cat":"cs.AI","submitted_at":"2026-06-04T07:15:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TAPO corrects credit misassignment in RL for multimodal search agents by using tool parameter similarity to share advantages across equivalent actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22177","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles","primary_cat":"cs.LG","submitted_at":"2026-05-21T08:47:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15792","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T09:48:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Generation-to-Understanding synergy lets multimodal models create self-generated visual edits as intermediate steps, improving performance on twelve benchmarks while revealing limits in task-aligned self-reflection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09860","ref_index":34,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-11T01:43:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kim Huat Chua, and Prahlad Vadakkepat. Adaptive action chunking at inference-time for vision-language-action models.arXiv preprint arXiv:2604.04161, 2026. [33] Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual agentic reinforcement fine-tuning, 2025. URL https://arxiv.org/abs/2505.14246. [34] David Q Mayne, James B Rawlings, Christopher V Rao, and Pierre OM Scokaert. Constrained model predictive control: Stability and optimality.Automatica, 36(6):789-814, 2000. [35] OpenAI. Thinking with images, 2025. URL https://openai.com/index/ thinking-with-images/. [36] Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of"},{"citing_arxiv_id":"2605.02730","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Perceptual Flow Network for Visually Grounded Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-04T15:31:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[30] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216-233. Springer, 2024. [31] Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. [32] Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual agentic reinforcement fine-tuning.URL https://arxiv. org/abs/2505.14246, 2025. [33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14 [34] Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia."},{"citing_arxiv_id":"2604.21409","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images","primary_cat":"cs.CV","submitted_at":"2026-04-23T08:23:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19857","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization","primary_cat":"cs.LG","submitted_at":"2026-04-21T17:21:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14520","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-16T01:21:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"For analytical tasks requiring transitive logic, we introduce a data-efficient SFT strategy on Music-AVQA. Extensive experiments demonstrate that our approach consistently matches or outperforms specialized models on fine-grained music perception benchmarks (Music-AVQA [15]), open-domain scenarios (AV-Odyssey [7], DailyOmni [39], OmniBench [18], WorldSense [10], AV-Counting [23]), and cross-modal hallucinations (AVH- Bench [24]). Our contributions are summarized as follows: (1) We system- atically diagnose the perceptual degradation in Omni-MLLMs, re- vealing that static topologies induce two severe artifacts: positional bias in sequential inputs and alignment traps in interleaved inputs. (2) We propose Chain of Modality (CoM), which formulates mul-"},{"citing_arxiv_id":"2604.14029","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch","primary_cat":"cs.CV","submitted_at":"2026-04-15T16:09:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ciency in static perception and reasoning. However, a critical epistemological gap remains: their internal knowledge is frozen upon training, rendering them prone to hallucinations when addressing real-time queries. To bridge this gap, the community is shifting towards agentic search, empowering LMMs to au- tonomously perceive, query, and synthesize information from the web environ- ment [29,32,49,55] to solve complex visual question answering (VQA) tasks. ⋆ These authors contributed equally to this work. arXiv:2604.14029v1 [cs.CV] 15 Apr 2026 2 F. Author et al. Multi-turn Agentic SearchTurn 1: Identify the groupAction 1: Image SearchObs 1: boygenius Turn 2: Gather evidenceAction 2: Web SearchObs 2: Every Which Way Turn 3: Verify the answerAction 3: Visit WebpageObs 3: Answerverified"},{"citing_arxiv_id":"2604.09508","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-10T17:25:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-horizon visual reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02794","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CharTool: Tool-Integrated Visual Reasoning for Chart Understanding","primary_cat":"cs.AI","submitted_at":"2026-04-03T07:02:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.16300","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection","primary_cat":"cs.AI","submitted_at":"2025-12-18T08:38:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ForenAgent lets MLLMs create and iteratively improve low-level Python tools for image forgery detection via a two-stage training pipeline and a new 100k-image benchmark dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.05271","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepEyesV2: Toward Agentic Multimodal Model","primary_cat":"cs.CV","submitted_at":"2025-11-07T14:31:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}