hub

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan · 2025 · cs.CV · arXiv 2503.01785

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

open full Pith review browse 22 citing papers arXiv PDF

abstract

Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks QFSD and AgriInsect.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

cs.CL · 2026-04-10 · unverdicted · novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

cs.CL · 2026-05-02 · unverdicted · novelty 6.0 · 2 refs

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL samples.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

cs.CV · 2025-04-10 · unverdicted · novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

Perceptual Flow Network for Visually Grounded Reasoning

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.

Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning

cs.LG · 2026-04-17 · unverdicted · novelty 5.0

CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-critical settings.

SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units

cs.CV · 2026-04-12 · unverdicted · novelty 5.0

SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.

SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

cs.AI · 2026-04-11 · unverdicted · novelty 5.0

SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

cs.CV · 2026-05-08 · unverdicted · novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

citing papers explorer

Showing 22 of 22 citing papers.

Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning cs.CV · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks QFSD and AgriInsect.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 30 · internal anchor
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic cs.AI · 2026-04-21 · unverdicted · none · ref 13 · internal anchor
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation cs.CL · 2026-04-10 · unverdicted · none · ref 28 · internal anchor
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models cs.LG · 2026-04-03 · unverdicted · none · ref 20 · internal anchor
RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 19 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology cs.CV · 2026-05-11 · unverdicted · none · ref 77 · internal anchor
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression cs.CL · 2026-05-02 · unverdicted · none · ref 97 · 2 links · internal anchor
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization cs.LG · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models cs.CV · 2026-04-06 · unverdicted · none · ref 7 · internal anchor
CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 29 · internal anchor
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs cs.CV · 2026-04-01 · unverdicted · none · ref 28 · internal anchor
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL samples.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 35 · internal anchor
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Perceptual Flow Network for Visually Grounded Reasoning cs.CV · 2026-05-04 · unverdicted · none · ref 31 · internal anchor
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 32 · internal anchor
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents cs.AI · 2026-04-19 · unverdicted · none · ref 32 · internal anchor
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning cs.LG · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-critical settings.
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units cs.CV · 2026-04-12 · unverdicted · none · ref 18 · internal anchor
SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning cs.AI · 2026-04-11 · unverdicted · none · ref 13 · internal anchor
SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 18 · internal anchor
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation cs.CV · 2026-05-08 · unverdicted · none · ref 16 · internal anchor
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 295 · internal anchor
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Visual-RFT: Visual Reinforcement Fine-Tuning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer