pith. machine review for the scientific record. sign in

arxiv: 2503.01785 · v1 · submitted 2025-03-03 · 💻 cs.CV

Recognition: unknown

Visual-RFT: Visual Reinforcement Fine-Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual reinforcement fine-tuninglarge vision-language modelsfew-shot learningobject detectionimage classificationpolicy optimizationreinforcement learning
0
0 comments X

The pith

Visual-RFT lets large vision-language models learn visual tasks from perceptual rewards instead of labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Visual-RFT to extend reinforcement fine-tuning from language models into visual domains for large vision-language models. It generates multiple reasoned responses per input image and refines the model using reward functions based on visual metrics such as Intersection over Union for detection. This yields substantial gains over supervised fine-tuning in low-data settings, including a 24.3 percent accuracy lift in one-shot fine-grained classification with roughly 100 samples and gains exceeding 20 points in two-shot object detection. A sympathetic reader would care because the method reduces reliance on large annotated datasets for adapting multimodal models to specific visual tasks.

Core claim

Visual-RFT first uses Large Vision-Language Models to generate multiple responses containing reasoning tokens and final answers for each input, and then uses proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization. Different verifiable reward functions are designed for different perception tasks, such as the Intersection over Union reward for object detection, producing competitive performance and advanced generalization on fine-grained image classification, few-shot object detection, reasoning grounding, and open-vocabulary object detection benchmarks.

What carries the argument

Visual perception verifiable reward functions, such as Intersection over Union for object detection, paired with Group Relative Policy Optimization to update the policy from multiple generated responses.

If this is right

  • Delivers a 24.3 percent accuracy increase over baseline in one-shot fine-grained image classification using around 100 samples.
  • Exceeds supervised fine-tuning by 21.9 points on COCO two-shot object detection and by 15.4 points on LVIS.
  • Improves results on reasoning grounding and open-vocabulary object detection relative to supervised baselines.
  • Offers a data-efficient alternative to supervised fine-tuning for domain-specific adaptation of vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward-driven approach could transfer to additional visual tasks such as segmentation if equivalent quantifiable metrics are available.
  • Future work could combine visual rewards with language-based rewards to strengthen cross-modal reasoning chains.
  • Gains may compound if base large vision-language models improve at generating diverse initial responses before optimization begins.

Load-bearing premise

That visual perception reward functions like IoU supply sufficiently dense and unbiased signals to guide effective policy optimization on visual tasks.

What would settle it

Apply Visual-RFT to a new visual task lacking a clear quantitative reward metric, such as subjective image quality assessment, and check whether accuracy fails to exceed that of supervised fine-tuning on the same limited data.

read the original abstract

Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Visual-RFT, extending reinforcement fine-tuning (RFT) with verifiable rewards to visual perception tasks in LVLMs. It generates multiple reasoning+answer responses per input via the base model, applies task-specific visual rewards (exemplified by IoU for detection), and optimizes the policy with GRPO. Experiments claim large gains over SFT on few-shot fine-grained classification (+24.3% with ~100 samples), COCO 2-shot detection (+21.9), LVIS, reasoning grounding, and open-vocabulary detection, positioning the method as a data-efficient, reward-driven alternative to supervised fine-tuning.

Significance. If the reported gains are shown to arise specifically from dense visual-perception rewards rather than from multi-sample generation or GRPO regularization alone, the work would demonstrate a practical route to reward-driven adaptation of LVLMs in data-scarce regimes. The multi-task coverage and direct SFT comparisons are positive; however, the absence of reward-function pseudocode, training-budget controls, and significance tests limits the strength of the central claim that visual rewards are the key differentiator.

major comments (3)
  1. [Abstract / §3] Abstract and §3 (reward design): the manuscript exemplifies only the IoU reward for detection; the exact functional form of the 'visual perception verifiable reward' for fine-grained classification is never stated. If this reward reduces to exact string match on the final answer token (or LVLM-judged correctness), it supplies no additional visual signal beyond the cross-entropy loss already used in the SFT baseline on the same ~100 samples, undermining the premise that visual rewards drive the 24.3% lift.
  2. [§4] §4 (experiments): no training-budget table, no wall-clock or token counts for the GRPO runs versus the SFT baselines, and no statistical significance tests (e.g., standard error over multiple seeds) are reported for the headline numbers (+24.3% classification, +21.9 COCO 2-shot). Without these controls it is impossible to rule out that the observed differences arise from longer effective optimization or variance rather than the visual reward.
  3. [§3.2] §3.2 (GRPO formulation): the paper adopts the standard GRPO objective without modification. The manuscript must isolate whether any performance increment survives when the visual reward is replaced by a non-visual answer-match reward; otherwise the central claim that 'visual perception verifiable reward functions' are the operative ingredient remains untested.
minor comments (2)
  1. [§3] Notation for the reward functions is introduced only by example; a single compact equation or pseudocode block listing r_class, r_det, r_grounding would improve reproducibility.
  2. [§4] Figure captions and axis labels in the few-shot detection plots omit the exact number of training samples per class; this information is only recoverable from the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and will revise the paper to strengthen the presentation and claims.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (reward design): the manuscript exemplifies only the IoU reward for detection; the exact functional form of the 'visual perception verifiable reward' for fine-grained classification is never stated. If this reward reduces to exact string match on the final answer token (or LVLM-judged correctness), it supplies no additional visual signal beyond the cross-entropy loss already used in the SFT baseline on the same ~100 samples, undermining the premise that visual rewards drive the 24.3% lift.

    Authors: We thank the referee for identifying this omission. The reward for fine-grained classification is a binary verifiable function: reward = 1 if the final answer string exactly matches the ground-truth class label, and 0 otherwise. This is directly computable from the output without external judges. While the reward evaluates answer correctness, the Visual-RFT pipeline differs from SFT by sampling multiple reasoning+answer trajectories per image and optimizing via GRPO, which reinforces visual reasoning paths that lead to correct classifications. We will add the explicit mathematical definition and pseudocode for the classification reward (alongside the IoU formulation) in the revised Section 3. revision: yes

  2. Referee: [§4] §4 (experiments): no training-budget table, no wall-clock or token counts for the GRPO runs versus the SFT baselines, and no statistical significance tests (e.g., standard error over multiple seeds) are reported for the headline numbers (+24.3% classification, +21.9 COCO 2-shot). Without these controls it is impossible to rule out that the observed differences arise from longer effective optimization or variance rather than the visual reward.

    Authors: We agree that these controls are required for a rigorous comparison. In the revised manuscript we will insert a new table in Section 4 that reports training budgets (total tokens, wall-clock time, and optimization steps) for Visual-RFT versus SFT on every benchmark. We will also rerun the primary experiments across multiple random seeds and report means with standard errors to quantify statistical significance of the gains. revision: yes

  3. Referee: [§3.2] §3.2 (GRPO formulation): the paper adopts the standard GRPO objective without modification. The manuscript must isolate whether any performance increment survives when the visual reward is replaced by a non-visual answer-match reward; otherwise the central claim that 'visual perception verifiable reward functions' are the operative ingredient remains untested.

    Authors: We accept the need for this isolation experiment. In the revision we will add an ablation that replaces the task-specific visual rewards (IoU for detection/grounding, exact-match for classification) with a non-visual answer-match reward that only checks final-answer correctness. Performance differences will be reported to test whether the visual component of the reward is responsible for the observed gains. For spatial tasks the non-visual reward necessarily omits dense localization signals, but we will present the comparison explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Visual-RFT derivation chain

full rationale

The paper's core method applies standard GRPO (from external prior work) to multiple LVLM-generated responses, using externally defined verifiable rewards such as IoU for detection and analogous task-specific functions for classification/grounding. These rewards are constructed from standard metrics independent of the model's fitted parameters or the target performance numbers. Reported gains (e.g., +24.3% on one-shot classification) are empirical benchmark results, not mathematical predictions or derivations that reduce to the inputs by construction. No self-definitional steps, fitted-input-as-prediction, or load-bearing self-citation chains appear in the described procedure. The derivation remains self-contained against external benchmarks and standard RL components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that task-specific verifiable rewards can be defined without circularity and that GRPO updates remain stable when rewards are sparse or noisy; no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Verifiable reward functions such as IoU can be computed reliably from model outputs and ground truth without additional learned components.
    Invoked when defining rewards for object detection and grounding tasks.

pith-pipeline@v0.9.0 · 5637 in / 1194 out tokens · 61460 ms · 2026-05-13T22:11:14.863124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

  2. Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...

  3. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  4. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

  5. UIPress: Bringing Optical Token Compression to UI-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...

  6. Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

    cs.LG 2026-04 unverdicted novelty 7.0

    RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

  7. GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    cs.CV 2025-04 unverdicted novelty 7.0

    GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...

  8. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

  9. RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

    cs.CV 2026-05 unverdicted novelty 6.0

    RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

  10. Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

    cs.CL 2026-05 unverdicted novelty 6.0

    A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...

  11. Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

    cs.CL 2026-05 unverdicted novelty 6.0

    A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.

  12. Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

    cs.LG 2026-04 unverdicted novelty 6.0

    Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.

  13. CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.

  14. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  15. EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...

  16. VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    cs.CV 2025-04 unverdicted novelty 6.0

    VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

  17. Perceptual Flow Network for Visually Grounded Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

  18. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  19. HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.

  20. Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...

  21. SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units

    cs.CV 2026-04 unverdicted novelty 5.0

    SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.

  22. SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

    cs.AI 2026-04 unverdicted novelty 5.0

    SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.

  23. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

  24. RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

    cs.CV 2026-05 unverdicted novelty 4.0

    RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

  25. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 24 Pith papers · 13 internal anchors

  1. [1]

    Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023

    Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforcement learn- ing with language models. arXiv preprint arXiv:2311.18232,

  2. [2]

    Internlm2 technical report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 4

  3. [3]

    Grounding large language models in interactive environments with on- line reinforcement learning

    Thomas Carta, Cl ´ement Romac, Thomas Wolf, Sylvain Lam- prier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with on- line reinforcement learning. In ICLR, 2023. 3

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1, 4, 5

  5. [5]

    Lvis: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 3, 7, 8, 9

  6. [6]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kem- ing Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 4

  7. [7]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv:2412.16720, 2024. 1, 2, 3

  8. [8]

    Preference optimiza- tion for reasoning with pseudo feedback

    Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F Chen, Shafiq Joty, and Furu Wei. Preference optimiza- tion for reasoning with pseudo feedback. arXiv preprint arXiv:2411.16345, 2024. 4

  9. [9]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 9

  10. [10]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV workshops, 2013. 7

  11. [11]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9579–9589, 2024. 3, 6, 8, 9

  12. [12]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\” ulu 3: Pushing frontiers in open language model post- training. arXiv preprint arXiv:2411.15124, 2024. 1, 4

  13. [13]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3 10

  14. [14]

    Open-vocabulary semantic segmentation with mask-adapted clip

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 7061–7070, 2023. 8

  15. [15]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 3, 8

  16. [16]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 3

  17. [17]

    task-updates

    Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-Reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024. 1, 4

  18. [18]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision , pages 38–55. Springer, 2024. 3, 8, 9

  19. [19]

    Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els

    Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els. arXiv preprint arXiv:2410.17637, 2024. 4

  20. [20]

    Reft: Reasoning with reinforced fine-tuning, 2024

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning, 2024. 4

  21. [21]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft. arXiv preprint arXiv:1306.5151 , 2013. 7

  22. [22]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In ICVGIP, 2008. 7

  23. [23]

    Hello gpt-4o, 2024

    OpenAI. Hello gpt-4o, 2024. 3

  24. [24]

    Openai o3-mini system card, 2025

    OpenAI. Openai o3-mini system card, 2025. 2

  25. [25]

    Training lan- guage models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. In NeurIPS, 2022. 3

  26. [26]

    Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to fol- low instructions ...

  27. [27]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In CVPR, 2012. 7

  28. [28]

    Is reinforcement learning (not) for natural language processing: Benchmarks, base- lines, and building blocks for natural language policy opti- mization

    Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kiant ´e Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Han- naneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, base- lines, and building blocks for natural language policy opti- mization. In ICLR, 2023. 3

  29. [29]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 9

  30. [30]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv:1707.06347, 2017. 4

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4

  32. [32]

    Offline RL for natural language generation with implicit language q learning

    Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, and Sergey Levine. Offline RL for natural language generation with implicit language q learning. In ICLR, 2023. 3

  33. [33]

    Learning to summarize with human feed- back

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feed- back. In NeurIPS, 2022. 3

  34. [35]

    Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023. 4

  35. [36]

    Aligning large multi- modal models with factually augmented rlhf

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf. In ACL, 2024. 3

  36. [37]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. 1, 4

  37. [38]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 3, 7, 9

  38. [39]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024. 4 11

  39. [40]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In ICLR, 2023. 3

  40. [41]

    Internlm-math: Open math large language models toward verifiable reasoning, 2024

    Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large lan- guage models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024. 4

  41. [42]

    RlHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. RlHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In CVPR, 2024. 4

  42. [43]

    RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness

    Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024. 4

  43. [44]

    Contextual object detection with mul- timodal large language models

    Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with mul- timodal large language models. IJCV, 2024. 3

  44. [45]

    Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model,

    Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. A simple yet effective multi-modal reward model. arXiv preprint arXiv:2501.12368, 2025. 1, 4

  45. [46]

    Codedpo: Aligning code models with self generated and verified source code

    Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. Codedpo: Aligning code models with self generated and verified source code. arXiv preprint arXiv:2410.05605, 2024. 4

  46. [47]

    Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

    Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual in- put and output. arXiv preprint arXiv:2407.03320, 2024. 3

  47. [48]

    o1-coder: an o1 replication for coding

    Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, and Jitao Sang. o1-coder: an o1 replication for coding. arXiv preprint arXiv:2412.00154 ,

  48. [49]

    Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

    Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Ji- aqi Wang, and Conghui He. Beyond hallucinations: Enhanc- ing lvlms through hallucination-aware direct preference op- timization. arXiv preprint arXiv:2311.16839, 2023. 4

  49. [51]

    arXiv preprint arXiv:2402.11411 , year=

    Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large lan- guage models via preference fine-tuning. arXiv preprint arXiv:2402.11411, 2024. 4

  50. [52]

    Archer: Training language model agents via hierarchical multi-turn rl

    Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl. In ICML, 2024. 3

  51. [53]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv:1909.08593, 2019. 3

  52. [54]

    Generalized decoding for pixel, image, and lan- guage

    Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and lan- guage. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 15116–15127,