{"total":11,"items":[{"citing_arxiv_id":"2605.19852","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T13:44:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13467","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-13T12:55:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12813","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations","primary_cat":"cs.CL","submitted_at":"2026-05-12T23:13:50+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12497","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Web to Pixels: Bringing Agentic Search into Visual Perception","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"1 58.33 24.68 37.97 40.00 39.20 51.95 40.22 GPT-5.4 44.44 28.57 35.44 60.00 48.80 53.25 49.38 GPT-4o 58.33 24.68 27.85 20.40 39.20 40.26 29.97 Gemini-3.1-Flash-Lite 55.56 23.38 43.04 40.40 40.80 49.35 40.68 Gemini-3.1-Pro 83.33 48.05 44.30 68.00 62.40 79.22 63.82 Open-Source QA Models UniVG-R1 [37] 30.56 19.74 31.65 22.13 36.00 22.08 26.22 SophiaVL-R1 [6] 38.89 23.68 22.78 27.46 34.40 37.66 29.67 VL-Rethinker [45] 33.33 23.68 31.65 27.05 33.60 29.87 29.20 Vision-R1 [46] 13.89 22.37 17.72 27.87 12.80 19.48 21.19 Open-Source General Models OneThinker-8B [39] 36.11 21.05 29.11 24.59 40.00 23.38 28.26 InternVL-3.5-8B [40] 25.0029.8732.91 42.00 36.00 31.17 36.02 Qwen3-VL-8B [36]47.2224.68 31.65 35.6043."},{"citing_arxiv_id":"2604.13602","ref_index":172,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"High-level issues involve physical implausibility, such as asymmetrical facial 24 Reward Hacking in the Era of Large Models Fudan NLP Group features or illogical object duplication [ 181, 211, 212]. In 3D generation, the \"Janus problem\"-where an object displays multiple front faces-is a classic instance ofevaluator-level exploitation: the generator reverse-engineers the blind spots of a static 2D-based proxy [172]. (2) Mode Collapse: Optimization amplification can drive the output distribution into a narrow set of high-reward patterns, sacrificing diversity [171, 190, 192, 213]. This can result in blank images [173], blurry video frames [ 214], or motionless clips [ 215]. (3) Capability Trade-offs: Models may achieve high scores on specific metrics while degrading overall composition or prompt alignment [173, 212]."},{"citing_arxiv_id":"2603.28767","ref_index":30,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gen-Searcher: Reinforcing Agentic Search for Image Generation","primary_cat":"cs.CV","submitted_at":"2026-03-30T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.17555","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking","primary_cat":"cs.CV","submitted_at":"2026-02-19T17:09:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.00181","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-01-30T04:45:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.16918","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaTooler-V: Adaptive Tool-Use for Images and Videos","primary_cat":"cs.CV","submitted_at":"2025-12-18T18:59:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03043","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OneThinker: All-in-one Reasoning Model for Image and Video","primary_cat":"cs.CV","submitted_at":"2025-12-02T18:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.21776","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video-R1: Reinforcing Video Reasoning in MLLMs","primary_cat":"cs.CV","submitted_at":"2025-03-27T17:59:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. [5] Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025. [6] Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025. [7] Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in"}],"limit":50,"offset":0}