Sophiavl-r1: Reinforcing mllms reasoning with thinking reward

Fan, Kaixuan, Feng, Kaituo, Lyu, Haoming, Zhou, Dongzhan, Yue, Xiangyu , month = may, year = · 2025 · arXiv 2505.17018

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

From Web to Pixels: Bringing Agentic Search into Visual Perception

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

Video-R1: Reinforcing Video Reasoning in MLLMs

cs.CV · 2025-03-27 · conditional · novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

citing papers explorer

Showing 4 of 4 citing papers after filters.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 71
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.
From Web to Pixels: Bringing Agentic Search into Visual Perception cs.CV · 2026-05-12 · unverdicted · none · ref 6
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning cs.CL · 2026-05-13 · unverdicted · none · ref 7
PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 172
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward

fields

years

verdicts

representative citing papers

citing papers explorer