arxiv: 2503.01785 · v1 · submitted 2025-03-03 · 💻 cs.CV

Recognition: unknown

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu , Zeyi Sun , Yuhang Zang , Xiaoyi Dong , Yuhang Cao , Haodong Duan , Dahua Lin , Jiaqi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual reinforcement fine-tuninglarge vision-language modelsfew-shot learningobject detectionimage classificationpolicy optimizationreinforcement learning

0 comments

The pith

Visual-RFT lets large vision-language models learn visual tasks from perceptual rewards instead of labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Visual-RFT to extend reinforcement fine-tuning from language models into visual domains for large vision-language models. It generates multiple reasoned responses per input image and refines the model using reward functions based on visual metrics such as Intersection over Union for detection. This yields substantial gains over supervised fine-tuning in low-data settings, including a 24.3 percent accuracy lift in one-shot fine-grained classification with roughly 100 samples and gains exceeding 20 points in two-shot object detection. A sympathetic reader would care because the method reduces reliance on large annotated datasets for adapting multimodal models to specific visual tasks.

Core claim

Visual-RFT first uses Large Vision-Language Models to generate multiple responses containing reasoning tokens and final answers for each input, and then uses proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization. Different verifiable reward functions are designed for different perception tasks, such as the Intersection over Union reward for object detection, producing competitive performance and advanced generalization on fine-grained image classification, few-shot object detection, reasoning grounding, and open-vocabulary object detection benchmarks.

What carries the argument

Visual perception verifiable reward functions, such as Intersection over Union for object detection, paired with Group Relative Policy Optimization to update the policy from multiple generated responses.

If this is right

Delivers a 24.3 percent accuracy increase over baseline in one-shot fine-grained image classification using around 100 samples.
Exceeds supervised fine-tuning by 21.9 points on COCO two-shot object detection and by 15.4 points on LVIS.
Improves results on reasoning grounding and open-vocabulary object detection relative to supervised baselines.
Offers a data-efficient alternative to supervised fine-tuning for domain-specific adaptation of vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-driven approach could transfer to additional visual tasks such as segmentation if equivalent quantifiable metrics are available.
Future work could combine visual rewards with language-based rewards to strengthen cross-modal reasoning chains.
Gains may compound if base large vision-language models improve at generating diverse initial responses before optimization begins.

Load-bearing premise

That visual perception reward functions like IoU supply sufficiently dense and unbiased signals to guide effective policy optimization on visual tasks.

What would settle it

Apply Visual-RFT to a new visual task lacking a clear quantitative reward metric, such as subjective image quality assessment, and check whether accuracy fails to exceed that of supervised fine-tuning on the same limited data.

read the original abstract

Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Visual-RFT ports GRPO with task-specific rewards like IoU to LVLMs and reports large few-shot gains, but the classification results likely rest on answer matching rather than visual feedback.

read the letter

Visual-RFT takes the R1-style reinforcement fine-tuning setup and applies it to large vision-language models using verifiable rewards tailored to perception tasks. The headline numbers are 24.3% accuracy lift on one-shot fine-grained classification with roughly 100 samples and a 21.9 point gain on COCO two-shot detection over the supervised baseline, with similar edges on LVIS and open-vocabulary detection. The method generates multiple responses per image, scores them with functions such as IoU for detection and grounding, then updates via GRPO. This is a direct extension of language-only verifiable reward training to the visual case, and the low-data regime makes the results practically relevant for adapting LVLMs when labels are scarce. The paper shows the approach beats standard SFT on the reported benchmarks and claims better generalization, which is the main concrete contribution. The experimental design is straightforward and the benchmarks are standard, so the setup itself is easy to understand. The central soft spot is the classification reward. The abstract only exemplifies IoU for detection and does not spell out the exact function used for fine-grained classification. If that reward reduces to exact string match on the final answer token, it supplies no new visual signal and the observed lift could be explained by the multi-response sampling or GRPO's implicit effects rather than any perception-specific feedback. That would weaken the core claim that visual rewards are the differentiator. There are also no reported details on baseline training compute, reward edge cases, or statistical significance, which leaves the numerical gains harder to interpret. This paper is aimed at people working on data-efficient multimodal adaptation or RL for vision-language models. A reader who wants concrete reward ideas for detection and grounding will get something usable from the full experiments, even if they have to verify the classification case themselves. It is coherent enough on its own terms to deserve a serious referee who can check the reward implementations and re-run the key comparisons.

Referee Report

3 major / 2 minor

Summary. The paper introduces Visual-RFT, extending reinforcement fine-tuning (RFT) with verifiable rewards to visual perception tasks in LVLMs. It generates multiple reasoning+answer responses per input via the base model, applies task-specific visual rewards (exemplified by IoU for detection), and optimizes the policy with GRPO. Experiments claim large gains over SFT on few-shot fine-grained classification (+24.3% with ~100 samples), COCO 2-shot detection (+21.9), LVIS, reasoning grounding, and open-vocabulary detection, positioning the method as a data-efficient, reward-driven alternative to supervised fine-tuning.

Significance. If the reported gains are shown to arise specifically from dense visual-perception rewards rather than from multi-sample generation or GRPO regularization alone, the work would demonstrate a practical route to reward-driven adaptation of LVLMs in data-scarce regimes. The multi-task coverage and direct SFT comparisons are positive; however, the absence of reward-function pseudocode, training-budget controls, and significance tests limits the strength of the central claim that visual rewards are the key differentiator.

major comments (3)

[Abstract / §3] Abstract and §3 (reward design): the manuscript exemplifies only the IoU reward for detection; the exact functional form of the 'visual perception verifiable reward' for fine-grained classification is never stated. If this reward reduces to exact string match on the final answer token (or LVLM-judged correctness), it supplies no additional visual signal beyond the cross-entropy loss already used in the SFT baseline on the same ~100 samples, undermining the premise that visual rewards drive the 24.3% lift.
[§4] §4 (experiments): no training-budget table, no wall-clock or token counts for the GRPO runs versus the SFT baselines, and no statistical significance tests (e.g., standard error over multiple seeds) are reported for the headline numbers (+24.3% classification, +21.9 COCO 2-shot). Without these controls it is impossible to rule out that the observed differences arise from longer effective optimization or variance rather than the visual reward.
[§3.2] §3.2 (GRPO formulation): the paper adopts the standard GRPO objective without modification. The manuscript must isolate whether any performance increment survives when the visual reward is replaced by a non-visual answer-match reward; otherwise the central claim that 'visual perception verifiable reward functions' are the operative ingredient remains untested.

minor comments (2)

[§3] Notation for the reward functions is introduced only by example; a single compact equation or pseudocode block listing r_class, r_det, r_grounding would improve reproducibility.
[§4] Figure captions and axis labels in the few-shot detection plots omit the exact number of training samples per class; this information is only recoverable from the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and will revise the paper to strengthen the presentation and claims.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (reward design): the manuscript exemplifies only the IoU reward for detection; the exact functional form of the 'visual perception verifiable reward' for fine-grained classification is never stated. If this reward reduces to exact string match on the final answer token (or LVLM-judged correctness), it supplies no additional visual signal beyond the cross-entropy loss already used in the SFT baseline on the same ~100 samples, undermining the premise that visual rewards drive the 24.3% lift.

Authors: We thank the referee for identifying this omission. The reward for fine-grained classification is a binary verifiable function: reward = 1 if the final answer string exactly matches the ground-truth class label, and 0 otherwise. This is directly computable from the output without external judges. While the reward evaluates answer correctness, the Visual-RFT pipeline differs from SFT by sampling multiple reasoning+answer trajectories per image and optimizing via GRPO, which reinforces visual reasoning paths that lead to correct classifications. We will add the explicit mathematical definition and pseudocode for the classification reward (alongside the IoU formulation) in the revised Section 3. revision: yes
Referee: [§4] §4 (experiments): no training-budget table, no wall-clock or token counts for the GRPO runs versus the SFT baselines, and no statistical significance tests (e.g., standard error over multiple seeds) are reported for the headline numbers (+24.3% classification, +21.9 COCO 2-shot). Without these controls it is impossible to rule out that the observed differences arise from longer effective optimization or variance rather than the visual reward.

Authors: We agree that these controls are required for a rigorous comparison. In the revised manuscript we will insert a new table in Section 4 that reports training budgets (total tokens, wall-clock time, and optimization steps) for Visual-RFT versus SFT on every benchmark. We will also rerun the primary experiments across multiple random seeds and report means with standard errors to quantify statistical significance of the gains. revision: yes
Referee: [§3.2] §3.2 (GRPO formulation): the paper adopts the standard GRPO objective without modification. The manuscript must isolate whether any performance increment survives when the visual reward is replaced by a non-visual answer-match reward; otherwise the central claim that 'visual perception verifiable reward functions' are the operative ingredient remains untested.

Authors: We accept the need for this isolation experiment. In the revision we will add an ablation that replaces the task-specific visual rewards (IoU for detection/grounding, exact-match for classification) with a non-visual answer-match reward that only checks final-answer correctness. Performance differences will be reported to test whether the visual component of the reward is responsible for the observed gains. For spatial tasks the non-visual reward necessarily omits dense localization signals, but we will present the comparison explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Visual-RFT derivation chain

full rationale

The paper's core method applies standard GRPO (from external prior work) to multiple LVLM-generated responses, using externally defined verifiable rewards such as IoU for detection and analogous task-specific functions for classification/grounding. These rewards are constructed from standard metrics independent of the model's fitted parameters or the target performance numbers. Reported gains (e.g., +24.3% on one-shot classification) are empirical benchmark results, not mathematical predictions or derivations that reduce to the inputs by construction. No self-definitional steps, fitted-input-as-prediction, or load-bearing self-citation chains appear in the described procedure. The derivation remains self-contained against external benchmarks and standard RL components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that task-specific verifiable rewards can be defined without circularity and that GRPO updates remain stable when rewards are sparse or noisy; no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Verifiable reward functions such as IoU can be computed reliably from model outputs and ground truth without additional learned components.
Invoked when defining rewards for object detection and grounding tasks.

pith-pipeline@v0.9.0 · 5637 in / 1194 out tokens · 61460 ms · 2026-05-13T22:11:14.863124+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
cs.AI 2026-04 unverdicted novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
cs.LG 2026-04 unverdicted novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
cs.CV 2025-04 unverdicted novelty 7.0

GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization
cs.LG 2026-04 unverdicted novelty 6.0

Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Perceptual Flow Network for Visually Grounded Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
cs.AI 2026-04 unverdicted novelty 5.0

HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
cs.LG 2026-04 unverdicted novelty 5.0

CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units
cs.CV 2026-04 unverdicted novelty 5.0

SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
cs.CV 2026-04 unverdicted novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 24 Pith papers · 13 internal anchors

[1]

Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023

Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforcement learn- ing with language models. arXiv preprint arXiv:2311.18232,

work page arXiv
[2]

Internlm2 technical report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 4

work page arXiv 2024
[3]

Grounding large language models in interactive environments with on- line reinforcement learning

Thomas Carta, Cl ´ement Romac, Thomas Wolf, Sylvain Lam- prier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with on- line reinforcement learning. In ICLR, 2023. 3

work page 2023
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 3, 7, 8, 9

work page 2019
[6]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kem- ing Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv:2412.16720, 2024. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Preference optimiza- tion for reasoning with pseudo feedback

Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F Chen, Shafiq Joty, and Furu Wei. Preference optimiza- tion for reasoning with pseudo feedback. arXiv preprint arXiv:2411.16345, 2024. 4

work page arXiv 2024
[9]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 9

work page 2023
[10]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV workshops, 2013. 7

work page 2013
[11]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9579–9589, 2024. 3, 6, 8, 9

work page 2024
[12]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\” ulu 3: Pushing frontiers in open language model post- training. arXiv preprint arXiv:2411.15124, 2024. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 7061–7070, 2023. 8

work page 2023
[15]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 3, 8

work page 2014
[16]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

task-updates

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-Reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024. 1, 4

work page arXiv 2024
[18]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision , pages 38–55. Springer, 2024. 3, 8, 9

work page 2024
[19]

Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els

Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els. arXiv preprint arXiv:2410.17637, 2024. 4

work page arXiv 2024
[20]

Reft: Reasoning with reinforced fine-tuning, 2024

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning, 2024. 4

work page 2024
[21]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft. arXiv preprint arXiv:1306.5151 , 2013. 7

work page internal anchor Pith review Pith/arXiv arXiv 2013
[22]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In ICVGIP, 2008. 7

work page 2008
[23]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024. 3

work page 2024
[24]

Openai o3-mini system card, 2025

OpenAI. Openai o3-mini system card, 2025. 2

work page 2025
[25]

Training lan- guage models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. In NeurIPS, 2022. 3

work page 2022
[26]

Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to fol- low instructions ...

work page 2022
[27]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In CVPR, 2012. 7

work page 2012
[28]

Is reinforcement learning (not) for natural language processing: Benchmarks, base- lines, and building blocks for natural language policy opti- mization

Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kiant ´e Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Han- naneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, base- lines, and building blocks for natural language policy opti- mization. In ICLR, 2023. 3

work page 2023
[29]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 9

work page 2024
[30]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv:1707.06347, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Offline RL for natural language generation with implicit language q learning

Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, and Sergey Levine. Offline RL for natural language generation with implicit language q learning. In ICLR, 2023. 3

work page 2023
[33]

Learning to summarize with human feed- back

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feed- back. In NeurIPS, 2022. 3

work page 2022
[35]

Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023. 4

work page arXiv 2023
[36]

Aligning large multi- modal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf. In ACL, 2024. 3

work page 2024
[37]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 3, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024. 4 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In ICLR, 2023. 3

work page 2023
[41]

Internlm-math: Open math large language models toward verifiable reasoning, 2024

Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large lan- guage models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024. 4

work page arXiv 2024
[42]

RlHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. RlHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In CVPR, 2024. 4

work page 2024
[43]

RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness

Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024. 4

work page arXiv 2024
[44]

Contextual object detection with mul- timodal large language models

Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with mul- timodal large language models. IJCV, 2024. 3

work page 2024
[45]

Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model,

Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. A simple yet effective multi-modal reward model. arXiv preprint arXiv:2501.12368, 2025. 1, 4

work page arXiv 2025
[46]

Codedpo: Aligning code models with self generated and verified source code

Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. Codedpo: Aligning code models with self generated and verified source code. arXiv preprint arXiv:2410.05605, 2024. 4

work page arXiv 2024
[47]

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual in- put and output. arXiv preprint arXiv:2407.03320, 2024. 3

work page arXiv 2024
[48]

o1-coder: an o1 replication for coding

Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, and Jitao Sang. o1-coder: an o1 replication for coding. arXiv preprint arXiv:2412.00154 ,

work page arXiv
[49]

Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Ji- aqi Wang, and Conghui He. Beyond hallucinations: Enhanc- ing lvlms through hallucination-aware direct preference op- timization. arXiv preprint arXiv:2311.16839, 2023. 4

work page arXiv 2023
[51]

arXiv preprint arXiv:2402.11411 , year=

Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large lan- guage models via preference fine-tuning. arXiv preprint arXiv:2402.11411, 2024. 4

work page arXiv 2024
[52]

Archer: Training language model agents via hierarchical multi-turn rl

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl. In ICML, 2024. 3

work page 2024
[53]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv:1909.08593, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1909
[54]

Generalized decoding for pixel, image, and lan- guage

Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and lan- guage. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 15116–15127,

work page