{"paper":{"title":"Visual-RFT: Visual Reinforcement Fine-Tuning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Visual-RFT lets large vision-language models learn visual tasks from perceptual rewards instead of labeled data.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Dahua Lin, Haodong Duan, Jiaqi Wang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Zeyi Sun, Ziyu Liu","submitted_at":"2025-03-03T18:16:32Z","abstract_excerpt":"Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Visual-RFT improves accuracy by 24.3% over the baseline in one-shot fine-grained image classification with around 100 samples and exceeds the baseline by 21.9 on COCO's two-shot setting.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the visual perception verifiable reward functions (e.g., IoU) provide sufficiently dense and unbiased signals to guide policy optimization without introducing new failure modes not present in language-only RFT.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Visual-RFT lets large vision-language models learn visual tasks from perceptual rewards instead of labeled data.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8076afea17fa7125bd56efb0afa36d68826a481619bf10d58d86d523ea1c3055"},"source":{"id":"2503.01785","kind":"arxiv","version":1},"verdict":{"id":"1829aa0f-ed0e-4ff8-8bd7-dcac36f656d3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-13T22:11:14.863124Z","strongest_claim":"Visual-RFT improves accuracy by 24.3% over the baseline in one-shot fine-grained image classification with around 100 samples and exceeds the baseline by 21.9 on COCO's two-shot setting.","one_line_summary":"Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the visual perception verifiable reward functions (e.g., IoU) provide sufficiently dense and unbiased signals to guide policy optimization without introducing new failure modes not present in language-only RFT.","pith_extraction_headline":"Visual-RFT lets large vision-language models learn visual tasks from perceptual rewards instead of labeled data."},"references":{"count":52,"sample":[{"doi":"","year":null,"title":"Lmrl gym: Benchmarks for multi-turn reinforcement learn- ing with language models","work_id":"bad1baad-5c3f-4456-ac96-29c8f5e78bfb","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"InternLM2 Technical Report","work_id":"dfa13e0e-1c3c-4fb6-943d-a19945bacdbe","ref_index":2,"cited_arxiv_id":"2403.17297","is_internal_anchor":true},{"doi":"","year":2023,"title":"Grounding large language models in interactive environments with on- line reinforcement learning","work_id":"3086d7ff-f5e3-4593-9700-3603bad5be12","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","ref_index":4,"cited_arxiv_id":"2501.12948","is_internal_anchor":true},{"doi":"","year":2019,"title":"Lvis: A dataset for large vocabulary instance segmentation","work_id":"d2430c96-329f-4510-aaa0-74f084edb36d","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":52,"snapshot_sha256":"093104e5a7a3d56c18936c23fdb27b5fd0a3cb53533ee3d4794f03e915dcfa41","internal_anchors":19},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}