WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
hub
Sophiavl-r1: Reinforcing mllms reasoning with thinking reward
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
FakeVLM-R1 combines GRPO reinforcement learning with critical-thinking CoT and a physics-annotated FakeClue++ dataset to reach claimed SOTA synthetic image detection while reducing over-rejection of real images.
citing papers explorer
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
-
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
-
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
-
AdaTooler-V: Adaptive Tool-Use for Images and Videos
AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
OneThinker: All-in-one Reasoning Model for Image and Video
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
-
FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection
FakeVLM-R1 combines GRPO reinforcement learning with critical-thinking CoT and a physics-annotated FakeClue++ dataset to reach claimed SOTA synthetic image detection while reducing over-rejection of real images.
- Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
- REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations